Paperid: 1, https://arxiv.org/pdf/2504.21776.pdf   GitHub
Authors:Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou
Title: WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Abstract:
Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.
中文: WebThinker是一种深度研究智能体,通过赋予大型推理模型自主网络搜索和实时报告撰写能力,在复杂推理基准测试中显著超越了现有方法。
English: WebThinker is a deep research agent that enhances large reasoning models by enabling autonomous web searching and real-time report drafting, significantly outperforming existing methods on complex reasoning benchmarks.

Authors:Jiuwu Hao, Liguo Sun, Yuting Wan, Yueyang Wu, Ti Xiang, Haolin Song, Pin Lv
Title: Is Intermediate Fusion All You Need for UAV-based Collaborative Perception?
Abstract:
Collaborative perception enhances environmental awareness through inter-agent communication and is regarded as a promising solution to intelligent transportation systems. However, existing collaborative methods for Unmanned Aerial Vehicles (UAVs) overlook the unique characteristics of the UAV perspective, resulting in substantial communication overhead. To address this issue, we propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF. The core concept is to exchange informative and compact detection results and shift the fusion stage to the feature representation level. In particular, we leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents. Additionally, we innovatively introduce an uncertainty-driven communication mechanism that uses uncertainty evaluation to select high-quality and reliable shared areas. Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality. Code and models are available at https://github.com/uestchjw/LIF.
中文: 本文提出LIF框架,通过后期中间融合和不确定性驱动机制,实现无人机间高效通信的协同感知,在保证性能的同时大幅降低通信开销。
English: This paper introduces LIF, a communication-efficient collaborative perception framework for UAVs that uses late-intermediate fusion and an uncertainty-driven mechanism to minimize bandwidth while maintaining high performance.

Authors:Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, Yi. R Fung
Title: MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Abstract:
The hallucination of non-existent facts by LLMs is an important problem given its widespread adoption across various applications. Previous research addresses this problem by analyzing the internal parameterized knowledge boundaries to estimate confidence. However, these studies focus on the single-problem setting and have not explored the more challenging multi-problem setting, which requires accurately answering multiple questions simultaneously. We introduce a novel method for the multi-problem setting, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25\% in average precision.
中文: 本文提出MAC-Tuning新方法,通过在指令微调中分离答案预测与置信度估计来解决多问题场景下的大语言模型幻觉问题,实验表明其平均精度比基线方法最高提升25%。
English: This paper introduces MAC-Tuning, a novel method that separates answer prediction and confidence estimation during fine-tuning to address LLM hallucination in multi-problem settings, achieving up to 25% higher average precision than baselines.

Authors:Bahram Jafrasteh, Wei Peng, Cheng Wan, Yimin Luo, Ehsan Adeli, Qingyu Zhao
Title: WASABI: A Metric for Evaluating Morphometric Plausibility of Synthetic Brain MRIs
Abstract:
Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brain MRIs. WASABI leverages \textit{SynthSeg}, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying anatomical discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing anatomical fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at https://github.com/BahramJafrasteh/wasabi-mri.
中文摘要:本研究提出名为WASABI的新指标,通过结合脑区分割与Wasserstein距离来评估合成脑部MRI的解剖学真实性,实验表明即使合成图像视觉质量近乎完美,该指标仍能比传统方法更敏感地检测解剖差异。
English Summary: This study introduces WASABI, a novel metric using Wasserstein distance and brain parcellation to evaluate anatomical realism in synthetic MRIs, demonstrating superior sensitivity to anatomical discrepancies compared to traditional metrics despite high visual quality.

Authors:Jonas Werner, Kun Chu, Cornelius Weber, Stefan Wermter
Title: LLM-based Interactive Imitation Learning for Robotic Manipulation
Abstract:
Recent advancements in machine learning provide methods to train autonomous agents capable of handling the increasing complexity of sequential decision-making in robotics. Imitation Learning (IL) is a prominent approach, where agents learn to control robots based on human demonstrations. However, IL commonly suffers from violating the independent and identically distributed (i.i.d) assumption in robotic tasks. Interactive Imitation Learning (IIL) achieves improved performance by allowing agents to learn from interactive feedback from human teachers. Despite these improvements, both approaches come with significant costs due to the necessity of human involvement. Leveraging the emergent capabilities of Large Language Models (LLMs) in reasoning and generating human-like responses, we introduce LLM-iTeach -- a novel IIL framework that utilizes an LLM as an interactive teacher to enhance agent performance while alleviating the dependence on human resources. Firstly, LLM-iTeach uses a hierarchical prompting strategy that guides the LLM in generating a policy in Python code. Then, with a designed similarity-based feedback mechanism, LLM-iTeach provides corrective and evaluative feedback interactively during the agent's training. We evaluate LLM-iTeach against baseline methods such as Behavior Cloning (BC), an IL method, and CEILing, a state-of-the-art IIL method using a human teacher, on various robotic manipulation tasks. Our results demonstrate that LLM-iTeach surpasses BC in the success rate and achieves or even outscores that of CEILing, highlighting the potential of LLMs as cost-effective, human-like teachers in interactive learning environments. We further demonstrate the method's potential for generalization by evaluating it on additional tasks. The code and prompts are provided at: https://github.com/Tubicor/LLM-iTeach.
中文: LLM-iTeach是一种创新的交互式模仿学习框架,利用大型语言模型作为交互式教师,通过生成策略和提供反馈来提升机器人智能体的性能,同时减少对人类资源的依赖,其效果达到甚至超越了人类教师指导的方法。
English: LLM-iTeach is a novel Interactive Imitation Learning framework that employs a Large Language Model as an interactive teacher, generating policies and providing feedback to enhance robotic agent performance while reducing reliance on human resources, achieving results comparable to or better than human-taught methods.

Authors:Ting Qiao, Yingjia Wang, Xing Liu, Sixing Wu, Jianbing Li, Yiming Li
Title: Cert-SSB: Toward Certified Sample-Specific Backdoor Defense
Abstract:
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an attacker manipulates a small portion of the training data to implant hidden backdoors into the model. The compromised model behaves normally on clean samples but misclassifies backdoored samples into the attacker-specified target class, posing a significant threat to real-world DNN applications. Currently, several empirical defense methods have been proposed to mitigate backdoor attacks, but they are often bypassed by more advanced backdoor techniques. In contrast, certified defenses based on randomized smoothing have shown promise by adding random noise to training and testing samples to counteract backdoor attacks. In this paper, we reveal that existing randomized smoothing defenses implicitly assume that all samples are equidistant from the decision boundary. However, it may not hold in practice, leading to suboptimal certification performance. To address this issue, we propose a sample-specific certified backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic gradient ascent to optimize the noise magnitude for each sample, ensuring a sample-specific noise level that is then applied to multiple poisoned training sets to retrain several smoothed models. After that, Cert-SSB aggregates the predictions of multiple smoothed models to generate the final robust prediction. In particular, in this case, existing certification methods become inapplicable since the optimized noise varies across different samples. To conquer this challenge, we introduce a storage-update-based certification method, which dynamically adjusts each sample's certification region to improve certification performance. We conduct extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of our proposed method. Our code is available at https://github.com/NcepuQiaoTing/Cert-SSB.
中文: 深度神经网络易受后门攻击,现有基于随机平滑的认证防御方法假设所有样本与决策边界等距,导致性能不佳;本文提出Cert-SSB方法,通过为每个样本优化噪声水平并采用动态认证机制,有效提升了防御能力。
English: Deep neural networks are susceptible to backdoor attacks, but existing certified defenses using randomized smoothing assume uniform sample distances from decision boundaries, leading to suboptimal performance; this paper introduces Cert-SSB, a sample-specific defense that optimizes noise levels per sample and employs a dynamic certification method to enhance robustness.

Authors:Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze
Title: LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics
Abstract:
We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.
中文摘要:本研究提出了一种基于大语言模型驱动代理编排架构的具身机器人系统,通过集成记忆增强任务规划和三个专业代理,无需显式模型训练即可实现自主家居物品管理,在多种家庭场景中展现出高任务规划精度和增强的记忆召回能力。
English Summary: This study introduces an embodied robotic system using an LLM-driven agent-orchestration architecture for autonomous household object management, integrating memory-augmented task planning and specialized agents to achieve high task accuracy and improved memory recall without explicit model training.

Authors:Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe
Title: Visual Text Processing: A Comprehensive Review and Unified Evaluation
Abstract:
Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.
Chinese: 本综述全面分析了视觉文本处理的最新进展,探讨了文本特征的选择与融合关键问题,并引入了新基准和评估指标,旨在为该领域的未来探索与创新提供基础资源。
English: This survey provides a comprehensive analysis of recent advances in visual text processing, addressing key questions about textual features and their integration into frameworks, while introducing a new benchmark and evaluation metric to guide future innovations in the field.

Authors:Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen
Title: Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
Abstract:
Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method (Ada-R1) significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1
中文摘要:提出的Ada-R1框架通过混合模型集成和双层偏好训练自适应选择推理深度,在数学数据集上保持性能的同时将推理长度缩减超50%。
English Summary: The proposed Ada-R1 framework adaptively selects reasoning depth through hybrid model integration and bi-level training, cutting reasoning length by over 50% while maintaining performance across mathematical datasets.

Authors:Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan
Title: HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
Abstract:
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.
中文: HoloTime框架通过集成视频扩散模型生成全景视频并重建4D场景,能将单个提示或图像转化为时空一致的沉浸式环境,为VR/AR应用提供完整的4D体验解决方案。
English: The HoloTime framework integrates video diffusion models to generate panoramic videos and reconstruct 4D scenes, enabling fully immersive VR/AR experiences by transforming single prompts or images into spatially and temporally consistent environments.

Authors:Jiaming wang, Yunke Zhao, Peng Ding, Jun Kuang, Yibin Shen, Zhe Tang, Yilin Jin, ZongYu Wang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Title: Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs' Instruction Following Capability
Abstract:
The capability to precisely adhere to instructions is a cornerstone for Large Language Models (LLMs) to function as dependable agents in real-world scenarios. However, confronted with complex prompts, LLMs frequently encounter difficulties in fulfilling all specified requirements within a single response. Drawing inspiration from recent advancements in Chain-of-Thought (CoT) prompting and self-correction methodologies, we introduce Meeseeks (The name is inspired by Mr. Meeseeks from "Rick and Morty," a character renowned for efficiently accomplishing assigned tasks. See: https://en.wikipedia.org/wiki/Mr._Meeseeks), a fully automated iterative instruction-following benchmark equipped with an integrated feedback mechanism. Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction. The dataset contains over 700 curated instances annotated by 32 distinct capability tags in Chinese and English. Extensive experimental results reveal that different state-of-the-art commercial and open-source LLMs exhibit vastly disparate performance, and even after 20 turns of iterative feedback-driven self-correction, nearly all models demonstrate suboptimal performance. We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models, as well as several counterintuitive phenomena. We've open-sourced our work on https://github.com/ADoublLEN/Meeseeks.
中文:Meeseeks基准测试是一个自动化系统,能识别大语言模型响应中的错误并提供迭代反馈以引导自我修正,但即便经过多轮修正,大多数模型仍表现欠佳。
English: The Meeseeks benchmark is an automated system that identifies errors in LLM responses and provides iterative feedback to guide self-correction, yet most models still underperform even after multiple rounds.

Authors:Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu
Title: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Abstract:
With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine
Chinese: Mcity数据引擎是一个开源系统,旨在解决机器学习模型数据选择和标注的难题,特别关注智能交通系统中的稀有类别,提供从数据采集到模型部署的完整开发周期。
English: The Mcity Data Engine is an open-source system designed to address the challenge of selecting and labeling data for machine learning models, particularly focusing on rare classes in Intelligent Transportation Systems by providing a complete development cycle from data acquisition to model deployment.

Authors:Bing Wang, Ximing Li, Changchun Li, Bingrui Zhao, Bo Fu, Renchu Guan, Shengsheng Wang
Title: Robust Misinformation Detection by Visiting Potential Commonsense Conflict
Abstract:
The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.
中文: 本文提出MD-PCC,一种用于虚假信息检测的即插即用增强方法,通过比较提取的常识三元组与推理得出的标准三元组来利用潜在常识冲突,在多个数据集上持续优于现有基线。
English: This paper introduces MD-PCC, a plug-and-play augmentation method for misinformation detection that leverages potential commonsense conflicts by comparing extracted and inferred commonsense triplets, consistently outperforming existing baselines across multiple datasets.

Authors:Hannes Reichert, Benjamin Serfling, Elijah Schüssler, Kerim Turacan, Konrad Doll, Bernhard Sick
Title: Real Time Semantic Segmentation of High Resolution Automotive LiDAR Scans
Abstract:
In recent studies, numerous previous works emphasize the importance of semantic segmentation of LiDAR data as a critical component to the development of driver-assistance systems and autonomous vehicles. However, many state-of-the-art methods are tested on outdated, lower-resolution LiDAR sensors and struggle with real-time constraints. This study introduces a novel semantic segmentation framework tailored for modern high-resolution LiDAR sensors that addresses both accuracy and real-time processing demands. We propose a novel LiDAR dataset collected by a cutting-edge automotive 128 layer LiDAR in urban traffic scenes. Furthermore, we propose a semantic segmentation method utilizing surface normals as strong input features. Our approach is bridging the gap between cutting-edge research and practical automotive applications. Additionaly, we provide a Robot Operating System (ROS2) implementation that we operate on our research vehicle. Our dataset and code are publicly available: https://github.com/kav-institute/SemanticLiDAR.
中文: 本研究针对高分辨率激光雷达数据,提出了一种利用表面法线和新城市数据集的实时语义分割框架,旨在弥合前沿研究与实际汽车应用之间的差距。
English: This study introduces a real-time semantic segmentation framework for high-resolution LiDAR data, utilizing surface normals and a new urban dataset to bridge research with practical automotive applications.

Authors:Saima Afrin, Md Zahidul Haque, Antonio Mastropaolo
Title: A Systematic Literature Review of Parameter-Efficient Fine-Tuning for Large Code Models
Abstract:
The rise of Artificial Intelligence (AI)-and particularly Large Language Models (LLMs) for code-has reshaped Software Engineering (SE) by enabling the automation of tasks such as code generation, bug detection, and repair. However, these models require significant computational resources for training and fine-tuning, posing challenges for real-world adoption in resource-constrained environments. To address this, the research community has increasingly turned to Parameter-Efficient Fine-Tuning (PEFT)-a class of techniques that enables the adaptation of large models by updating only a small subset of parameters, rather than the entire model. In this Systematic Literature Review (SLR), we examine the growing application of PEFT techniques-across a wide range of software engineering tasks. We analyze how these methods are used to optimize various deep learning (DL) architectures, focusing on their impact on both performance and efficiency. Our study synthesizes findings from 28 peer-reviewed papers, identifying patterns in configuration strategies and adaptation trade-offs. The outcome of this review is a comprehensive taxonomy that categorizes PEFT usage by task type, distinguishing between generative (e.g., Code Summarization) and non-generative (e.g., Code Clone Detection) scenarios. Our findings aim to inform future research and guide the practical deployment of PEFT in sustainable, AI-powered software development. Our artifacts are publicly available at https://github.com/alvi75/SLR-PEFT
中文摘要:人工智能和大语言模型的兴起通过自动化代码生成和错误检测等任务重塑了软件工程,但其高计算资源需求在资源受限环境中构成挑战,因此研究界转向参数高效微调技术,仅更新少量参数即可优化性能与效率。
English Summary: The rise of AI and LLMs has transformed software engineering by automating tasks like code generation and bug detection, but their high computational demands hinder adoption in resource-limited settings, leading to increased focus on Parameter-Efficient Fine-Tuning (PEFT) techniques that optimize performance and efficiency by updating only a small subset of parameters.

Authors:Uzair Shah, Marco Agus, Daniya Boges, Vanessa Chiappini, Mahmood Alzubaidi, Jens Schneider, Markus Hadwiger, Pierre J. Magistretti, Mowafa Househ, Corrado Calı
Title: SAM4EM: Efficient memory-based two stage prompt-free segment anything model adapter for complex 3D neuroscience electron microscopy stacks
Abstract:
We present SAM4EM, a novel approach for 3D segmentation of complex neural structures in electron microscopy (EM) data by leveraging the Segment Anything Model (SAM) alongside advanced fine-tuning strategies. Our contributions include the development of a prompt-free adapter for SAM using two stage mask decoding to automatically generate prompt embeddings, a dual-stage fine-tuning method based on Low-Rank Adaptation (LoRA) for enhancing segmentation with limited annotated data, and a 3D memory attention mechanism to ensure segmentation consistency across 3D stacks. We further release a unique benchmark dataset for the segmentation of astrocytic processes and synapses. We evaluated our method on challenging neuroscience segmentation benchmarks, specifically targeting mitochondria, glia, and synapses, with significant accuracy improvements over state-of-the-art (SOTA) methods, including recent SAM-based adapters developed for the medical domain and other vision transformer-based approaches. Experimental results indicate that our approach outperforms existing solutions in the segmentation of complex processes like glia and post-synaptic densities. Our code and models are available at https://github.com/Uzshah/SAM4EM.
中文: SAM4EM提出了一种新方法,通过结合无提示适配器、双阶段微调和3D记忆注意力机制来增强分割一切模型,实现了电子显微镜数据中神经结构的3D分割,在复杂基准测试中表现优异,并发布了专用数据集。
English: SAM4EM introduces a novel method for 3D segmentation of neural structures in electron microscopy data by enhancing the Segment Anything Model with a prompt-free adapter, dual-stage fine-tuning, and 3D memory attention, achieving superior accuracy on challenging benchmarks and releasing a specialized dataset.

Authors:Mengting Wei, Yante Li, Tuomas Varanka, Yan Jiang, Guoying Zhao
Title: MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
Abstract:
In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.
中文: 本研究提出一种视频人脸重演方法,将FLAME三维人脸模型融入隐式扩散框架,通过几何引导和特征融合增强形状一致性与运动控制,实现了高质量、精准的面部动画生成。
English: This study introduces a video face reenactment method that integrates the FLAME 3D face model into a latent diffusion framework, enhancing shape consistency and motion control through geometric guidance and feature fusion for high-quality, precise facial animations.

Authors:Qinfeng Zhu, Yunxi Jiang, Lei Fan
Title: ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery
Abstract:
We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.
中文:ClassWise-CRF架构通过类别自适应加权融合专家网络预测并结合CRF优化,在遥感图像语义分割中显著提升了mIoU指标。
English: The ClassWise-CRF architecture enhances semantic segmentation by adaptively fusing expert networks' predictions using category-specific weighting and CRF optimization, achieving significant mIoU improvements on remote sensing datasets.

Authors:Hebaixu Wang, Jing Zhang, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Title: DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
Abstract:
Diffusion models have achieved remarkable progress in universal image restoration. While existing methods speed up inference by reducing sampling steps, substantial step intervals often introduce cumulative errors. Moreover, they struggle to balance the commonality of degradation representations and restoration quality. To address these challenges, we introduce \textbf{DGSolver}, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models and tailor high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments show that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively. Code and models will be available at https://github.com/MiliLab/DGSolver.
中文: DGSolver 是一种扩散通用求解器,通过结合高阶求解器与加速采样策略及通用后验采样,有效提升了图像修复的精度、稳定性和扩展性,显著优于现有技术。
English: DGSolver is a diffusion generalist solver that enhances image restoration by employing high-order solvers with accelerated sampling and universal posterior sampling, achieving superior accuracy, stability, and scalability over existing methods.

Authors:Jingjing Liu, Nian Wu, Xianchao Xiu, Jianhua Zhang
Title: Robust Orthogonal NMF with Label Propagation for Image Clustering
Abstract:
Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called robust orthogonal nonnegative matrix factorization (RONMF). This method not only considers the graph Laplacian and label propagation as regularization terms but also introduces a more effective non-convex structure to measure the reconstruction error and imposes orthogonal constraints on the basis matrix to reduce the noise corruption, thereby achieving higher robustness. To solve RONMF, we develop an alternating direction method of multipliers (ADMM)-based optimization algorithm. In particular, all subproblems have closed-form solutions, which ensures its efficiency. Experimental evaluations on eight public image datasets demonstrate that the proposed RONMF outperforms state-of-the-art NMF methods across various standard metrics and shows excellent robustness. The code will be available at https://github.com/slinda-liu.
中文: 本文提出了一种鲁棒正交非负矩阵分解(RONMF)方法,通过结合标签传播和非凸优化来提高聚类精度和抗噪性,在八个图像数据集上的实验验证了其优越性能。
English: This paper introduces a robust orthogonal nonnegative matrix factorization (RONMF) method that enhances clustering accuracy and noise resistance by incorporating label propagation and non-convex optimization, validated through superior performance on eight image datasets.

Authors:Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, Fei Richard Yu
Title: RWKV-X: A Linear Complexity Hybrid Language Model
Abstract:
In this paper, we introduce RWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.
中文: RWKV-X是一种混合架构,将RWKV的短程建模效率与稀疏注意力机制相结合以捕捉长程上下文,在训练中实现线性时间复杂度和推理中恒定时间复杂度,在长上下文基准测试中超越先前模型,同时保持优异的短上下文性能。
English: RWKV-X is a hybrid architecture combining RWKV's efficiency for short-range modeling with sparse attention for long-range context, achieving linear-time training and constant-time inference while outperforming previous models on long-context benchmarks and maintaining strong short-context performance.

Authors:Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
Abstract:
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on standalone videos and mainly assess "visual elements" like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.
中文: 本文提出了SeriesBench这一评估多模态大语言模型对叙事驱动视频系列理解能力的新基准,并开发了PC-DCoT推理框架,该框架能有效提升模型在分析复杂剧情结构和角色关系方面的性能表现。
English: This paper introduces SeriesBench, a novel benchmark for evaluating Multi-modal Large Language Models' understanding of narrative-driven video series, and proposes PC-DCoT, a reasoning framework that enhances models' performance in analyzing complex plot structures and character relationships.

Authors:Weicai Yan, Wang Lin, Zirun Guo, Ye Wang, Fangming Feng, Xiaoda Yang, Zehan Wang, Tao Jin
Title: Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision
Abstract:
Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diffusion-Driven Prompt Generator (Diff-Prompt), aiming to use the diffusion model to generate rich and fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved Diffusion Transformer (DiT) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available at https://github.com/Kelvin-ywc/diff-prompt.
中文: 提示学习在微调多模态模型中展现出潜力,但在复杂任务中因提示信息不够丰富而受限,为此提出的Diff-Prompt方法利用扩散模型生成精细提示,在实验中取得了显著性能提升。
English: Prompt learning shows potential in fine-tuning multimodal models but struggles with complex tasks due to limited prompt richness, leading to the proposed Diff-Prompt method that uses a diffusion model to generate detailed prompts and achieves significant performance improvements in experiments.

Authors:Xinyi Liu, Yujie Wang, Shenhan Zhu, Fangcheng Fu, Qingshuo Liu, Guangming Lin, Bin Cui
Title: Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
Abstract:
Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.
中文:Galvatron是一个开源分布式系统,能自动选择最优混合并行策略来高效训练大规模基础模型,提供更高吞吐量和友好易用的界面。
English: Galvatron is an open-source distributed system that automatically determines the optimal hybrid parallelism strategy for training large-scale Foundation Models, delivering higher throughput and user-friendly accessibility.

Authors:Yumeng Shi, Quanyu Long, Wenya Wang
Title: Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Abstract:
Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To alleviate this, existing works propose to compress video inputs, but often overlook the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. We propose a novel token selection strategy, \textsc{explore-then-select}, that adaptively adjusts static and dynamic information based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Then it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our framework is plug-and-play and can be seamlessly integrated within diverse video language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8\%) on multiple video question answering benchmarks. Our code is available at https://github.com/ANDgate99/Explore-Then-Select .
Chinese: 我们提出的“探索后选择”框架根据查询需求自适应平衡静态与动态视频标记,通过即插即用集成在视频问答基准上实现最高5.8%的性能提升。
English: Our proposed "explore-then-select" framework adaptively balances static and dynamic video tokens based on query requirements, achieving up to 5.8% performance gains on video QA benchmarks through plug-and-play integration.

Authors:Hong Zhang, Zhongjie Duan, Xingjun Wang, Yuze Zhao, Weiyi Lu, Zhipeng Di, Yixuan Xu, Yingda Chen, Yu Zhang
Title: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Abstract:
Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.
中文: Nexus-Gen提出了一种在共享图像嵌入空间中融合自回归与扩散模型的统一架构,通过创新的预填充自回归策略和多任务训练,在图像理解、生成和编辑任务中实现了最先进的性能。
English: Nexus-Gen introduces a unified architecture combining autoregressive and diffusion models in a shared image embedding space, achieving state-of-the-art performance in image understanding, generation, and editing through innovative prefilled autoregression and multi-task training.

Authors:Luoting Zhuang, Seyed Mohammad Hossein Tabatabaei, Ramin Salehi-Rad, Linh M. Tran, Denise R. Aberle, Ashley E. Prosper, William Hsu
Title: Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction
Abstract:
Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,246 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.812), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.
中文: 本研究整合放射科医生的语义特征与微调的CLIP模型预测肺癌,在多个数据集中实现卓越性能并提供可解释结果。
English: This research integrates radiologists' semantic features with a fine-tuned CLIP model to predict lung cancer, achieving superior performance and explainable results across multiple datasets.

Authors:Khoa Tuan Nguyen, Ho-min Park, Gaeun Oh, Joris Vankerschaver, Wesley De Neve
Title: Towards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability
Abstract:
We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at https://github.com/Khoa-NT/isbi2025_ps3c.
中文: 我们提出了一种基于EVA-02转换器模型的宫颈细胞图像分类新方法,通过四步流程实现了0.85227的优异F1分数,并利用SHAP分析提供了可解释的决策依据。
English: We introduce a novel cervical cell classification method using the EVA-02 transformer model with a four-step pipeline that achieved a superior F1-score of 0.85227 and provided interpretable insights through SHAP analysis.

Authors:Zhelun Shen, Zhuo Li, Chenming Wu, Zhibo Rao, Lina Liu, Yuchao Dai, Liangjun Zhang
Title: CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching
Abstract:
Recently, learning-based stereo matching methods have achieved great improvement in public benchmarks, where soft argmin and smooth L1 loss play a core contribution to their success. However, in unsupervised domain adaptation scenarios, we observe that these two operations often yield multimodal disparity probability distributions in target domains, resulting in degraded generalization. In this paper, we propose a novel approach, Constrain Multi-modal Distribution (CMD), to address this issue. Specifically, we introduce \textit{uncertainty-regularized minimization} and \textit{anisotropic soft argmin} to encourage the network to produce predominantly unimodal disparity distributions in the target domain, thereby improving prediction accuracy. Experimentally, we apply the proposed method to multiple representative stereo-matching networks and conduct domain adaptation from synthetic data to unlabeled real-world scenes. Results consistently demonstrate improved generalization in both top-performing and domain-adaptable stereo-matching models. The code for CMD will be available at: \href{https://github.com/gallenszl/CMD}{https://github.com/gallenszl/CMD}.
中文: 本文提出约束多模态分布(CMD)方法,通过不确定性正则化最小化和各向异性软argmin操作,在立体匹配的无监督域适应中促进单模态视差分布,从而提升模型在真实场景中的泛化性能。
English: This paper introduces the Constrain Multi-modal Distribution (CMD) method to enhance unsupervised domain adaptation in stereo matching by promoting unimodal disparity distributions through uncertainty-regularized minimization and anisotropic soft argmin, improving generalization in real-world scenarios.

Authors:Jinpeng Wang, Tianci Luo, Yaohua Zha, Yan Feng, Ruisheng Luo, Bin Chen, Tao Dai, Long Chen, Yaowei Wang, Shu-Tao Xia
Title: Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
Abstract:
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at https://github.com/gimpong/CVPR25-Condenser.
中文: 本研究提出Condenser方法,通过协同压缩多个提示实现视觉上下文学习优化,在基准测试中展现出优于现有技术的效率与性能表现。
English: The study introduces Condenser, a prompt condensation method that collaboratively compresses multiple prompts to enhance visual in-context learning, outperforming existing techniques in efficiency and performance across benchmarks.

Authors:Sixuan Wang, Jiao Yin, Jinli Cao, MingJian Tang, Hua Wang, Yanchun Zhang
Title: ABG-NAS: Adaptive Bayesian Genetic Neural Architecture Search for Graph Representation Learning
Abstract:
Effective and efficient graph representation learning is essential for enabling critical downstream tasks, such as node classification, link prediction, and subgraph search. However, existing graph neural network (GNN) architectures often struggle to adapt to diverse and complex graph structures, limiting their ability to produce structure-aware and task-discriminative representations. To address this challenge, we propose ABG-NAS, a novel framework for automated graph neural network architecture search tailored for efficient graph representation learning. ABG-NAS encompasses three key components: a Comprehensive Architecture Search Space (CASS), an Adaptive Genetic Optimization Strategy (AGOS), and a Bayesian-Guided Tuning Module (BGTM). CASS systematically explores diverse propagation (P) and transformation (T) operations, enabling the discovery of GNN architectures capable of capturing intricate graph characteristics. AGOS dynamically balances exploration and exploitation, ensuring search efficiency and preserving solution diversity. BGTM further optimizes hyperparameters periodically, enhancing the scalability and robustness of the resulting architectures. Empirical evaluations on benchmark datasets (Cora, PubMed, Citeseer, and CoraFull) demonstrate that ABG-NAS consistently outperforms both manually designed GNNs and state-of-the-art neural architecture search (NAS) methods. These results highlight the potential of ABG-NAS to advance graph representation learning by providing scalable and adaptive solutions for diverse graph structures. Our code is publicly available at https://github.com/sserranw/ABG-NAS.
中文摘要:ABG-NAS框架通过自适应遗传优化和贝叶斯调优自动搜索图神经网络最优架构,在基准数据集上持续超越现有方法,为多样化图结构提供可扩展的解决方案。
English Summary: ABG-NAS is an automated framework that efficiently searches for optimal graph neural network architectures through adaptive genetic optimization and Bayesian tuning, consistently outperforming existing methods on benchmark datasets.

Authors:Xuanzhao Dong, Wenhui Zhu, Hao Wang, Xiwen Chen, Peijie Qiu, Rui Yin, Yi Su, Yalin Wang
Title: Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA
Abstract:
Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.
Chinese: Discuss-RAG通过引入基于智能体的协作推理机制来提升医学问答系统的检索相关性和答案准确性,在多个基准数据集上显著超越了现有方法的性能表现。
English: Discuss-RAG enhances medical QA systems by introducing collaborative agent-based reasoning to improve retrieval relevance and answer accuracy, achieving significant performance gains over existing methods.

Authors:Alexander L. Mitchell, Tobit Flatscher, Ingmar Posner
Title: Task and Joint Space Dual-Arm Compliant Control
Abstract:
Robots that interact with humans or perform delicate manipulation tasks must exhibit compliance. However, most commercial manipulators are rigid and suffer from significant friction, limiting end-effector tracking accuracy in torque-controlled modes. To address this, we present a real-time, open-source impedance controller that smoothly interpolates between joint-space and task-space compliance. This hybrid approach ensures safe interaction and precise task execution, such as sub-centimetre pin insertions. We deploy our controller on Frank, a dual-arm platform with two Kinova Gen3 arms, and compensate for modelled friction dynamics using a model-free observer. The system is real-time capable and integrates with standard ROS tools like MoveIt!. It also supports high-frequency trajectory streaming, enabling closed-loop execution of trajectories generated by learning-based methods, optimal control, or teleoperation. Our results demonstrate robust tracking and compliant behaviour even under high-friction conditions. The complete system is available open-source at https://github.com/applied-ai-lab/compliant_controllers.
中文摘要:本研究提出了一种实时开源的混合阻抗控制器,通过动态切换关节空间与任务空间的柔顺性,使机器人既能实现安全交互又能完成精密操作,并在双机械臂平台上验证了高摩擦条件下的鲁棒性能。
English Summary: This study introduces a real-time, open-source hybrid impedance controller that enables robots to achieve both safe interaction and precise task execution by dynamically switching between joint-space and task-space compliance, with validation on a dual-arm platform demonstrating robust performance under high friction.

Authors:Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang
Title: Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization
Abstract:
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on personalized samples. Although the mixture of experts (MoE) offers a promising solution for specialization, existing MoE-based methods suffer from coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG, which treats multiple prompts as distinct experts. Unlike existing image-level routing designs, TRIP assigns different tokens within an image to specific experts. To ensure communication efficiency, TRIP incorporates a parameter-free routing mechanism based on token clustering and optimal transport. The instance-specific prompt is then synthesized by aggregating experts, weighted by the number of tokens assigned to each. Additionally, TRIP develops an unbiased learning strategy for prompt experts, leveraging the VLM's zero-shot generalization capability. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communication of only 1K parameters per round. Our code is available at https://github.com/GongShuai8210/TRIP.
中文: TRIP提出了一种用于联邦领域泛化的无参数路由令牌级提示混合框架,通过将图像令牌分配给专业专家,实现了高效通信和增强的模型泛化能力。
English: TRIP introduces a token-level prompt mixture framework with parameter-free routing for federated domain generalization, enabling efficient communication and enhanced model generalization by assigning image tokens to specialized experts.

Authors:Yu Zheng, Longyi Liu, Yuming Lin, Jie Feng, Guozhen Zhang, Depeng Jin, Yong Li
Title: UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
Abstract:
The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
中文: 本文提出UrbanPlanBench基准测试,揭示大语言模型在城市规划知识方面存在显著不足(尤其对法规理解薄弱),并发布UrbanPlanText微调数据集——虽能提升模型表现,但在专业术语与推理方面仍需大幅改进。
English: This paper introduces UrbanPlanBench, a benchmark that reveals large language models' significant limitations in urban planning knowledge, particularly in regulatory understanding, and presents UrbanPlanText—a fine-tuning dataset that improves model performance while highlighting ongoing challenges in domain-specific reasoning.

Authors:Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, Dong Yu
Title: WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model
Abstract:
Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre-trained web knowledge in LLMs. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability. Code is available at https://github.com/Tencent/SelfEvolvingAgent
中文: 该框架引入协同进化的世界模型大语言模型,既作为虚拟网络服务器生成自指导训练数据,又在推理时充当想象引擎,通过突破探索限制并利用预训练知识,在网络环境中实现了10%的性能提升。
English: The proposed framework introduces a co-evolving World Model LLM that acts as both a virtual web server for generating self-instructed training data and an imagination engine during inference, achieving a 10% performance gain in web environments by overcoming exploration limitations and leveraging pre-trained knowledge.

Authors:Yinghan Zhou, Juan Wen, Wanli Peng, Yiming Xue, Ziwei Zhang, Zhengxian Wu
Title: Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations
Abstract:
The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.
Chinese: 本文提出了一种名为DP-Net的新型AI生成文本检测方法,通过强化学习引入动态扰动,在跨域场景中展现出卓越的泛化能力,并在对抗攻击下实现了最优的鲁棒性表现。
English: This paper introduces DP-Net, a novel AI-generated text detection method that employs dynamic perturbations via reinforcement learning, demonstrating superior generalization across domains and enhanced robustness against adversarial attacks compared to existing approaches.

Authors:Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
Title: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Abstract:
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention
中文: Softpick 是 Transformer 注意力机制中 softmax 的改进替代方案,它消除了注意力汇聚现象并显著降低激活峰度,在量化模型中表现更优,为多种优化技术开辟了新途径。
English: Softpick is a rectified, non-sum-to-one alternative to softmax in transformer attention that eliminates attention sinks and reduces activation kurtosis, improving performance in quantized models and enabling new optimization possibilities.

Authors:Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang
Title: AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
Abstract:
We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm
中文: AegisLLM是一种多智能体防御系统,通过协同工作的智能体角色和自动提示优化来增强大语言模型的安全性,在无需重新训练模型的情况下,在遗忘和越狱基准测试中取得了显著改进。
English: AegisLLM is a multi-agent defense system that enhances LLM security through collaborative agent roles and automated prompt optimization, achieving significant improvements in unlearning and jailbreaking benchmarks without requiring model retraining.

Authors:Shangyu Li, Juyong Jiang, Tiancheng Zhao, Jiasi Shen
Title: OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Abstract:
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.
中文摘要:OSVBench是一个基于Hyperkernel操作系统内核的新基准,包含245个复杂任务,用于评估大语言模型在生成操作系统内核验证规范代码方面的能力,结果显示当前模型在处理长上下文代码生成任务上表现有限。
English Summary: OSVBench is a new benchmark for evaluating LLMs in generating complete specification code for operating system kernel verification, built upon the Hyperkernel with 245 complex tasks, revealing current models' limited performance in long-context code generation.

Authors:Quentin Guimard, Moreno D'IncÃ, Massimiliano Mancini, Elisa Ricci
Title: Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
Abstract:
A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect: this greatly limits the number of tasks where model biases can be identified. In this work, we present Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting biases together with task-specific target labels. A retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.
中文:C2B框架无需标注数据即可检测预训练模型中的偏见,它通过任务描述生成偏见建议并评估模型准确性,其性能优于依赖标注的方法,推动了无监督偏见检测的发展。
English: The C2B framework enables bias detection in pre-trained models without labeled data by using task descriptions to generate bias proposals and assess model accuracy, outperforming annotation-dependent methods and advancing unsupervised bias discovery.

Authors:Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes
Title: Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
Abstract:
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.
中文: 该方法通过限制轨迹回报而非直接丢弃来优化CVaR,在多种环境中均展现出比基线方法更优的性能表现。
English: The proposed method improves sample efficiency in CVaR optimization by capping trajectory returns instead of discarding them, demonstrating consistent performance gains across multiple environments.

Authors:Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli
Title: AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Abstract:
The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
Chinese: Ai-GenBench 是一种新颖的时序基准,旨在通过评估检测方法对新型生成模型(如从GAN到扩散模型的过渡)的泛化能力,来稳健检测AI生成图像,同时为实际应用提供标准化工具和数据集。
English: Ai-GenBench is a novel temporal benchmark designed to robustly detect AI-generated images by evaluating detection methods' generalization to new generative models, such as the transition from GANs to diffusion models, while providing standardized tools and datasets for practical use.

Authors:Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, Elisa Ricci
Title: FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
Abstract:
In federated learning, textual prompt tuning adapts Vision-Language Models (e.g., CLIP) by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. After training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning suffers from overfitting to known concepts, limiting its generalizability to unseen concepts. To address this limitation, we propose Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on multimodal contextual information - derived from the input image and textual attribute features of a class. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through a cross-attention mechanism. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets, spanning three generalization settings, demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains, surpassing state-of-the-art methods by a notable margin of +1.57% - 2.26%. Code is available at https://github.com/mainaksingha01/FedMVP.
中文摘要:提出的FedMVP方法通过结合图像和文本特征生成多模态视觉提示,在保持已知概念性能的同时,显著提升了联邦学习对未知概念的泛化能力。
English Summary: The proposed FedMVP method enhances federated learning by generating multimodal visual prompts using both image and text features, significantly improving generalization to unseen concepts while maintaining performance on known ones.

Authors:Haitao Wu, Zongbo Han, Joey Tianyi Zhou, Huaxi Huang, Changqing Zhang
Title: Computational Reasoning of Large Language Models
Abstract:
With the rapid development and widespread application of Large Language Models (LLMs), multidimensional evaluation has become increasingly critical. However, current evaluations are often domain-specific and overly complex, limiting their effectiveness as cross-domain proxies for core capabilities. To address these limitations and enable a unified and simple evaluation framework, an ideal proxy task should target a basic capability that generalizes across tasks and is independent of domain-specific knowledge. Turing machine provides a powerful theoretical lens by reducing complex processes to basic, domain-agnostic computational operations. This perspective offers a principled framework for evaluating basic computational abilities essential to a wide range of tasks. Motivated by this abstraction, we introduce \textbf{Turing Machine Bench}, a benchmark designed to assess the ability of LLMs to \textbf{strictly follow rules} and \textbf{accurately manage internal states} for multi-step, referred to as \textbf{computational reasoning}. TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a solid theoretical foundation based on Turing machine. Empirical results demonstrate that TMBench serves as an effective proxy for evaluating computational reasoning on representative LLMs. It produces clear step-wise accuracy curves, revealing LLMs' ability to execute multi-step reasoning processes. By analyzing performance trends across TMBench and established reasoning benchmarks, we find strong correlations with real-world tasks, bridging real-task evaluation with basic ability assessment. These findings suggest that TMBench holds potential as a cross-domain dimension for evaluating reasoning in LLMs. Code and data are available at \href{https://github.com/HaitaoWuTJU/Turing-Machine-Bench}{Repo}.
中文: 该摘要介绍了基于图灵机原理的TMBench基准测试,通过评估大语言模型在多步骤过程中严格遵循规则和管理内部状态的能力来测试其计算推理水平,并显示出与实际任务的强相关性。
English: The abstract introduces TMBench, a benchmark based on Turing machine principles to evaluate LLMs' computational reasoning by testing their ability to strictly follow rules and manage internal states across multi-step processes, showing strong correlations with real-world tasks.

Authors:Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
Title: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Abstract:
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.
中文: 本研究通过分析大语言模型的中间推理步骤,质疑最终答案的可靠性,并发现聚合分段子思维的答案能显著提高不同模型和数据集上的准确性。
English: This study questions the reliability of final answers from Large Language Models by analyzing intermediate reasoning steps and finds that aggregating answers from segmented subthoughts significantly improves accuracy across various models and datasets.

Authors:Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang
Title: In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Abstract:
Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.
中文摘要:ICEdit提出了一种基于扩散变换器的高效指令图像编辑方法,通过上下文编辑、参数高效微调和噪声样本优化三大创新,仅用极少训练数据和参数即实现了最优编辑性能。
English Summary: ICEdit introduces an efficient instruction-based image editing method using Diffusion Transformers, achieving superior performance with minimal training data and parameters through in-context editing, parameter-efficient fine-tuning, and noise sample optimization.

Authors:Long Liu, Cihui Yang
Title: OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation
Abstract:
Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.
中文: 针对表格结构识别中的几何变形问题,本文提出了OG-HFYOLO模型以增强边缘响应和多尺度特征适应,并创建了DWTAL数据集来支持细粒度单元格空间坐标定位。
English: The OG-HFYOLO model is proposed to address geometric deformation in table structure recognition by enhancing edge detection and multi-scale feature adaptation, while a new dataset DWTAL is created to support fine-grained spatial coordinate localization.

Authors:Adam Gudyś, Cezary Maszczyk, Joanna Badura, Adam Grzelak, Marek Sikora, Łukasz Wróbel
Title: RuleKit 2: Faster and simpler rule learning
Abstract:
Rules offer an invaluable combination of predictive and descriptive capabilities. Our package for rule-based data analysis, RuleKit, has proven its effectiveness in classification, regression, and survival problems. Here we present its second version. New algorithms and optimized implementations of those previously included, significantly improved the computational performance of our suite, reducing the analysis time of some data sets by two orders of magnitude. The usability of RuleKit 2 is provided by two new components: Python package and browser application with a graphical user interface. The former complies with scikit-learn, the most popular data mining library for Python, allowing RuleKit 2 to be straightforwardly integrated into existing data analysis pipelines. RuleKit 2 is available at GitHub under GNU AGPL 3 license (https://github.com/adaa-polsl/RuleKit)
Chinese: RuleKit 2 版本引入了增强算法和优化实现,显著提升了计算性能,并通过新增的Python包和浏览器图形界面,实现了与现有数据分析流程的无缝集成。
English: RuleKit 2 introduces enhanced algorithms and optimized implementations, drastically improving computational performance and adding Python and browser-based interfaces for seamless integration into data analysis workflows.

Authors:Andrew Fitzgibbon, Stephen Felix
Title: On Stochastic Rounding with Few Random Bits
Abstract:
Large-scale numerical computations make increasing use of low-precision (LP) floating point formats and mixed precision arithmetic, which can be enhanced by the technique of stochastic rounding (SR), that is, rounding an intermediate high-precision value up or down randomly as a function of the value's distance to the two rounding candidates. Stochastic rounding requires, in addition to the high-precision input value, a source of random bits. As the provision of high-quality random bits is an additional computational cost, it is of interest to require as few bits as possible while maintaining the desirable properties of SR in a given computation, or computational domain. This paper examines a number of possible implementations of few-bit stochastic rounding (FBSR), and shows how several natural implementations can introduce sometimes significant bias into the rounding process, which are not present in the case of infinite-bit, infinite-precision examinations of these implementations. The paper explores the impact of these biases in machine learning examples, and hence opens another class of configuration parameters of which practitioners should be aware when developing or adopting low-precision floating point. Code is available at http://github.com/graphcore-research/arith25-stochastic-rounding.
中文摘要:本文研究了几种少位随机舍入方法,发现某些实现方式会引入理想情况下不存在的显著偏差,并在机器学习应用中展示了这些偏差的影响。
English Summary: This paper investigates few-bit stochastic rounding methods, revealing that certain implementations can introduce significant biases not present in ideal scenarios, and demonstrates their impact in machine learning applications.

Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao
Title: ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Abstract:
Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.
Chinese: 本研究首次提出了沉浸式空间戏剧生成模型ISDrama,它通过多模态提示生成连续的戏剧性双耳语音,在客观和主观评估中均优于现有基线模型。
English: This research introduces the first immersive spatial drama generation model, ISDrama, which utilizes multimodal prompts to create continuous, dramatic binaural speech and outperforms existing baselines in both objective and subjective evaluations.

Authors:Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer
Title: ReasonIR: Training Retrievers for Reasoning Tasks
Abstract:
We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.
中文:ReasonIR-8B是首个专为通用推理任务训练的检索器,通过创新的合成数据生成方法,在推理基准测试中取得最优性能,并显著提升了RAG任务的表现。
English: ReasonIR-8B is the first retriever specifically trained for general reasoning tasks, achieving state-of-the-art performance on reasoning benchmarks and significantly enhancing RAG task results through a novel synthetic data generation pipeline.

Authors:Rilind Sahitaj, Paulius Sasnauskas, Yiğit Yalın, Debmalya Mandal, Goran Radanović
Title: Independent Learning in Performative Markov Potential Games
Abstract:
Performative Reinforcement Learning (PRL) refers to a scenario in which the deployed policy changes the reward and transition dynamics of the underlying environment. In this work, we study multi-agent PRL by incorporating performative effects into Markov Potential Games (MPGs). We introduce the notion of a performatively stable equilibrium (PSE) and show that it always exists under a reasonable sensitivity assumption. We then provide convergence results for state-of-the-art algorithms used to solve MPGs. Specifically, we show that independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) converge to an approximate PSE in the best-iterate sense, with an additional term that accounts for the performative effects. Furthermore, we show that INPG asymptotically converges to a PSE in the last-iterate sense. As the performative effects vanish, we recover the convergence rates from prior work. For a special case of our game, we provide finite-time last-iterate convergence results for a repeated retraining approach, in which agents independently optimize a surrogate objective. We conduct extensive experiments to validate our theoretical findings.
中文: 本研究针对马尔可夫势博弈中的多智能体表演性强化学习提出了表演性稳定均衡(PSE)概念,通过理论证明和实验验证了独立策略梯度方法能够收敛至近似PSE的结论。
English: This work introduces performatively stable equilibrium (PSE) for multi-agent performative reinforcement learning in Markov Potential Games, demonstrating that independent policy gradient methods converge to approximate PSE with theoretical guarantees validated by experiments.

Authors:Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Abstract:
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.
Chinese: 单样本可验证奖励强化学习(RLVR)显著提升了大语言模型的数学推理能力,将MATH500基准测试准确率从36.0%提升至73.6%,并展现出跨领域泛化能力和训练饱和后的持续性能提升。
English: One-shot reinforcement learning with verifiable reward (RLVR) significantly enhances large language models' mathematical reasoning, boosting performance on benchmarks like MATH500 from 36.0% to 73.6% and demonstrating cross-domain generalization and post-saturation improvement.

Authors:Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen
Title: Dynamic Attention Analysis for Backdoor Detection in Text-to-Image Diffusion Models
Abstract:
Recent studies have revealed that text-to-image diffusion models are vulnerable to backdoor attacks, where attackers implant stealthy textual triggers to manipulate model outputs. Previous backdoor detection methods primarily focus on the static features of backdoor samples. However, a vital property of diffusion models is their inherent dynamism. This study introduces a novel backdoor detection perspective named Dynamic Attention Analysis (DAA), showing that these dynamic characteristics serve as better indicators for backdoor detection. Specifically, by examining the dynamic evolution of cross-attention maps, we observe that backdoor samples exhibit distinct feature evolution patterns at the $<$EOS$>$ token compared to benign samples. To quantify these dynamic anomalies, we first introduce DAA-I, which treats the tokens' attention maps as spatially independent and measures dynamic feature using the Frobenius norm. Furthermore, to better capture the interactions between attention maps and refine the feature, we propose a dynamical system-based approach, referred to as DAA-S. This model formulates the spatial correlations among attention maps using a graph-based state equation and we theoretically analyze the global asymptotic stability of this method. Extensive experiments across five representative backdoor attack scenarios demonstrate that our approach significantly surpasses existing detection methods, achieving an average F1 Score of 79.49% and an AUC of 87.67%. The code is available at https://github.com/Robin-WZQ/DAA.
中文: 本研究提出动态注意力分析(DAA)方法,通过捕捉交叉注意力图的动态演化模式来有效检测文本到图像扩散模型中的后门攻击,在多项实验中平均F1分数达79.49%,显著优于现有检测方法。
English: This study introduces Dynamic Attention Analysis (DAA), a novel detection method that leverages the dynamic evolution patterns in cross-attention maps to effectively identify backdoor attacks in text-to-image diffusion models, significantly outperforming existing approaches with an average F1 Score of 79.49%.

Authors:Yichu Xu, Di Wang, Hongzan Jiao, Lefei Zhang, Liangpei Zhang
Title: MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification
Abstract:
Mamba-based models have recently demonstrated significant potential in hyperspectral image (HSI) classification, primarily due to their ability to perform contextual modeling with linear computational complexity. However, existing Mamba-based approaches often overlook the directional modeling heterogeneity across different land-cover types, leading to limited classification performance. To address these limitations, we propose MambaMoE, a novel spectral-spatial Mixture-of-Experts (MoE) framework, which represents the first MoE-based approach in the HSI classification domain. Specifically, we design a Mixture of Mamba Expert Block (MoMEB) that performs adaptive spectral-spatial feature modeling via a sparse expert activation mechanism. Additionally, we introduce an uncertainty-guided corrective learning (UGCL) strategy that encourages the model to focus on complex regions prone to prediction ambiguity. This strategy dynamically samples supervision signals from regions with high predictive uncertainty, guiding the model to adaptively refine feature representations and thereby enhancing its focus on challenging areas. Extensive experiments conducted on multiple public HSI benchmark datasets show that MambaMoE achieves state-of-the-art performance in both classification accuracy and computational efficiency compared to existing advanced methods, particularly Mamba-based ones. The code will be available online at https://github.com/YichuXu/MambaMoE.
中文: MambaMoE提出了一种新颖的光谱空间专家混合框架,通过自适应特征建模和不确定性引导的校正学习,在高光谱图像分类中实现了最优性能并具备卓越的计算效率。
English: MambaMoE introduces a novel spectral-spatial Mixture-of-Experts framework with adaptive feature modeling and uncertainty-guided corrective learning, achieving state-of-the-art performance in hyperspectral image classification with superior computational efficiency.

Authors:Elena Martinez, Beatrice Moscoloni, Matteo Salvador, Fanwei Kong, Mathias Peirlinck, Alison Lesley Marsden
Title: Full-field surrogate modeling of cardiac function encoding geometric variability
Abstract:
Combining physics-based modeling with data-driven methods is critical to enabling the translation of computational methods to clinical use in cardiology. The use of rigorous differential equations combined with machine learning tools allows for model personalization with uncertainty quantification in time frames compatible with clinical practice. However, accurate and efficient surrogate models of cardiac function, built from physics-based numerical simulation, are still mostly geometry-specific and require retraining for different patients and pathological conditions. We propose a novel computational pipeline to embed cardiac anatomies into full-field surrogate models. We generate a dataset of electrophysiology simulations using a complex multi-scale mathematical model coupling partial and ordinary differential equations. We adopt Branched Latent Neural Maps (BLNMs) as an effective scientific machine learning method to encode activation maps extracted from physics-based numerical simulations into a neural network. Leveraging large deformation diffeomorphic metric mappings, we build a biventricular anatomical atlas and parametrize the anatomical variability of a small and challenging cohort of 13 pediatric patients affected by Tetralogy of Fallot. We propose a novel statistical shape modeling based z-score sampling approach to generate a new synthetic cohort of 52 biventricular geometries that are compatible with the original geometrical variability. This synthetic cohort acts as the training set for BLNMs. Our surrogate model demonstrates robustness and great generalization across the complex original patient cohort, achieving an average adimensional mean squared error of 0.0034. The Python implementation of our BLNM model is publicly available under MIT License at https://github.com/StanfordCBCL/BLNM.
中文摘要:本研究提出一种将心脏解剖结构嵌入全场替代模型的计算流程,通过分支潜在神经映射和合成解剖采样,在法洛四联症儿科患者中实现了强大的泛化能力。
English Summary: This study introduces a computational pipeline that integrates cardiac anatomy into full-field surrogate models using Branched Latent Neural Maps, achieving robust generalization across pediatric patients with Tetralogy of Fallot through synthetic anatomical sampling.

Authors:Jiajun Ding, Beiyao Zhu, Xiaosheng Liu, Lishen Zhang, Zhao Liu
Title: LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
Abstract:
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.
中文: 本研究通过整合PET和CT数据构建了三维多模态淋巴瘤分割数据集,基于深度学习实现病灶精准分割并推动精准医疗发展。
English: This study creates a 3D multimodal lymphoma segmentation dataset by integrating PET and CT data, enabling accurate lesion segmentation through deep learning and advancing precision medicine.

Authors:Derui Shan, Peng Guo, Wenshuo Li, Du Tao
Title: LPVIMO-SAM: Tightly-coupled LiDAR/Polarization Vision/Inertial/Magnetometer/Optical Flow Odometry via Smoothing and Mapping
Abstract:
We propose a tightly-coupled LiDAR/Polarization Vision/Inertial/Magnetometer/Optical Flow Odometry via Smoothing and Mapping (LPVIMO-SAM) framework, which integrates LiDAR, polarization vision, inertial measurement unit, magnetometer, and optical flow in a tightly-coupled fusion. This framework enables high-precision and highly robust real-time state estimation and map construction in challenging environments, such as LiDAR-degraded, low-texture regions, and feature-scarce areas. The LPVIMO-SAM comprises two subsystems: a Polarized Vision-Inertial System and a LiDAR/Inertial/Magnetometer/Optical Flow System. The polarized vision enhances the robustness of the Visual/Inertial odometry in low-feature and low-texture scenarios by extracting the polarization information of the scene. The magnetometer acquires the heading angle, and the optical flow obtains the speed and height to reduce the accumulated error. A magnetometer heading prior factor, an optical flow speed observation factor, and a height observation factor are designed to eliminate the cumulative errors of the LiDAR/Inertial odometry through factor graph optimization. Meanwhile, the LPVIMO-SAM can maintain stable positioning even when one of the two subsystems fails, further expanding its applicability in LiDAR-degraded, low-texture, and low-feature environments. Code is available on https://github.com/junxiaofanchen/LPVIMO-SAM.
中文: LPVIMO-SAM框架通过紧密耦合激光雷达、偏振视觉、惯性、磁力计和光流数据,并利用因子图优化,在激光雷达退化或低纹理等挑战性环境中实现高精度、高鲁棒性的实时状态估计与地图构建。
English: The LPVIMO-SAM framework integrates LiDAR, polarization vision, inertial, magnetometer, and optical flow data through tight coupling and factor graph optimization to achieve highly precise and robust real-time state estimation and mapping in challenging environments like LiDAR-degraded or low-texture areas.

Authors:Amaan Izhar, Nurul Japar, Norisma Idris, Ting Dang
Title: MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation
Abstract:
Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at https://github.com/AI-14/micar-vl-moe.
Chinese: 提出的MicarVLMoE模型通过门控交叉对齐融合架构解决了细粒度特征提取和多模态对齐的局限,在包括CT扫描和MRI在内的多种医学影像类型上实现了最先进的性能。
English: The proposed MicarVLMoE model overcomes limitations in fine-grained feature extraction and multimodal alignment through its gated cross-aligned fusion architecture, achieving state-of-the-art performance across diverse medical imaging types including CT scans and MRI.

Authors:Cedric Le Gentil, Leonardo Brizi, Daniil Lisus, Xinyuan Qiao, Giorgio Grisetti, Timothy D. Barfoot
Title: DRO: Doppler-Aware Direct Radar Odometry
Abstract:
A renaissance in radar-based sensing for mobile robotic applications is underway. Compared to cameras or lidars, millimetre-wave radars have the ability to `see' through thin walls, vegetation, and adversarial weather conditions such as heavy rain, fog, snow, and dust. In this paper, we propose a novel SE(2) odometry approach for spinning frequency-modulated continuous-wave radars. Our method performs scan-to-local-map registration of the incoming radar data in a direct manner using all the radar intensity information without the need for feature or point cloud extraction. The method performs locally continuous trajectory estimation and accounts for both motion and Doppler distortion of the radar scans. If the radar possesses a specific frequency modulation pattern that makes radial Doppler velocities observable, an additional Doppler-based constraint is formulated to improve the velocity estimate and enable odometry in geometrically feature-deprived scenarios (e.g., featureless tunnels). Our method has been validated on over 250km of on-road data sourced from public datasets (Boreas and MulRan) and collected using our automotive platform. With the aid of a gyroscope, it outperforms state-of-the-art methods and achieves an average relative translation error of 0.26% on the Boreas leaderboard. When using data with the appropriate Doppler-enabling frequency modulation pattern, the translation error is reduced to 0.18% in similar environments. We also benchmarked our algorithm using 1.5 hours of data collected with a mobile robot in off-road environments with various levels of structure to demonstrate its versatility. Our real-time implementation is publicly available: https://github.com/utiasASRL/dro.
中文: 本文提出了一种用于旋转雷达的新型SE(2)里程计方法,通过直接扫描到地图配准和多普勒约束,在多种环境中实现了卓越的定位精度,并经过大量数据验证。
English: A novel SE(2) odometry method for spinning radars is introduced, using direct scan-to-map registration and Doppler constraints to achieve superior accuracy in diverse environments, as validated by extensive testing.

Authors:Junlin Guo, James R. Zimmer-Dauphinee, Jordan M. Nieusma, Siqi Lu, Quan Liu, Ruining Deng, Can Cui, Jialin Yue, Yizhe Lin, Tianyuan Yao, Juming Xiong, Junchao Zhu, Chongyu Qu, Yuechen Yang, Mitchell Wilkes, Xiao Wang, Parker VanValkenburgh, Steven A. Wernke, Yuankai Huo
Title: DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes
Abstract:
By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.
Chinese: 本文提出了DeepAndes,一种基于Transformer架构的视觉基础模型,通过在三百万张多光谱卫星图像上训练,并针对8波段影像优化自监督学习算法,在考古遥感任务中表现出色,显著优于现有模型在少样本场景下的性能。
English: This paper introduces DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, which achieves superior performance in archaeological remote sensing tasks by optimizing self-supervised learning for 8-band imagery and outperforming existing models in few-shot scenarios.

Authors:Stefan Kober
Title: Radius-Guided Post-Clustering for Shape-Aware, Scalable Refinement of k-Means Results
Abstract:
Traditional k-means clustering underperforms on non-convex shapes and requires the number of clusters k to be specified in advance. We propose a simple geometric enhancement: after standard k-means, each cluster center is assigned a radius (the distance to its farthest assigned point), and clusters whose radii overlap are merged. This post-processing step loosens the requirement for exact k: as long as k is overestimated (but not excessively), the method can often reconstruct non-convex shapes through meaningful merges. We also show that this approach supports recursive partitioning: clustering can be performed independently on tiled regions of the feature space, then globally merged, making the method scalable and suitable for distributed systems. Implemented as a lightweight post-processing step atop scikit-learn's k-means, the algorithm performs well on benchmark datasets, achieving high accuracy with minimal additional computation.
中文: 该方法通过为聚类中心分配半径并合并重叠簇来增强k均值算法,能够在k值被高估时重建非凸形状,同时通过递归分区实现可扩展的分布式计算。
English: The proposed method enhances k-means clustering by assigning radii to cluster centers and merging overlapping clusters, enabling reconstruction of non-convex shapes with overestimated k values while supporting scalable distributed computation through recursive partitioning.

Authors:Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, Sam Thomson
Title: MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
Abstract:
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
中文摘要:提出的MICE方法通过逐层相似性分析和概率分类计算内部置信度,显著提升了工具使用代理的安全性和实用性,在不同风险场景下均优于基线模型的校准精度和工具调用效能。
English Summary: The proposed MICE method enhances tool-using agents' safety and utility by computing internal confidence scores through layer-wise similarity analysis and probabilistic classification, outperforming baselines in calibration and tool-calling effectiveness across diverse scenarios.

Authors:Zae Myung Kim, Chanwoo Park, Vipul Raheja, Suin Kim, Dongyeop Kang
Title: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
Abstract:
Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, from essay writing to mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and data can be accessed at: https://github.com/minnesotanlp/mpo
中文: 元策略优化(MPO)通过引入元奖励模型,在训练中动态优化奖励提示,有效应对奖励破解并减少对人工提示工程的依赖,同时在多样化任务中保持或超越手工设计提示的性能表现。
English: Meta Policy Optimization (MPO) introduces a meta-reward model that dynamically refines reward prompts during training, effectively combating reward hacking and reducing reliance on manual prompt engineering while maintaining or surpassing performance across diverse tasks.

Authors:Alireza Kazemi, Helia Rezvani, Mahsa Baktashmotlagh
Title: Benchmarking Transferability: A Framework for Fair and Robust Evaluation
Abstract:
Transferability scores aim to quantify how well a model trained on one domain generalizes to a target domain. Despite numerous methods proposed for measuring transferability, their reliability and practical usefulness remain inconclusive, often due to differing experimental setups, datasets, and assumptions. In this paper, we introduce a comprehensive benchmarking framework designed to systematically evaluate transferability scores across diverse settings. Through extensive experiments, we observe variations in how different metrics perform under various scenarios, suggesting that current evaluation practices may not fully capture each method's strengths and limitations. Our findings underscore the value of standardized assessment protocols, paving the way for more reliable transferability measures and better-informed model selection in cross-domain applications. Additionally, we achieved a 3.5\% improvement using our proposed metric for the head-training fine-tuning experimental setup. Our code is available in this repository: https://github.com/alizkzm/pert_robust_platform.
中文: 本文提出了一个全面的基准测试框架,用于系统评估迁移性评分,发现不同指标在多种场景下表现各异,强调标准化评估协议对于提升跨领域模型选择可靠性的重要性。
English: This paper introduces a comprehensive benchmarking framework to systematically evaluate transferability scores, revealing variations in metric performance and advocating for standardized assessment protocols to enhance reliability in cross-domain model selection.

Authors:Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, Mingjun Xiao
Title: AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers
Abstract:
Machine Learning (ML) research is spread through academic papers featuring rich multimodal content, including text, diagrams, and tabular results. However, translating these multimodal elements into executable code remains a challenging and time-consuming process that requires substantial ML expertise. We introduce ``Paper-to-Code'' (P2C), a novel task that transforms the multimodal content of scientific publications into fully executable code repositories, which extends beyond the existing formulation of code generation that merely converts textual descriptions into isolated code snippets. To automate the P2C process, we propose AutoP2C, a multi-agent framework based on large language models that processes both textual and visual content from research papers to generate complete code repositories. Specifically, AutoP2C contains four stages: (1) repository blueprint extraction from established codebases, (2) multimodal content parsing that integrates information from text, equations, and figures, (3) hierarchical task decomposition for structured code generation, and (4) iterative feedback-driven debugging to ensure functionality and performance. Evaluation on a benchmark of eight research papers demonstrates the effectiveness of AutoP2C, which can successfully generate executable code repositories for all eight papers, while OpenAI-o1 or DeepSeek-R1 can only produce runnable code for one paper. The code is available at https://github.com/shoushouyu/Automated-Paper-to-Code.
中文摘要:“论文到代码”(P2C)任务通过AutoP2C框架将机器学习研究论文中的多模态内容转化为可执行的代码库,其性能优于现有模型,成功为全部八篇测试论文生成了可运行代码。
English Summary: The "Paper-to-Code" (P2C) task transforms multimodal content from ML research papers into executable code repositories using the AutoP2C framework, which outperforms existing models by successfully generating functional code for all eight tested papers.

Authors:Zhonghao Li, Kunpeng Zhang, Jinghuai Ou, Shuliang Liu, Xuming Hu
Title: TreeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering
Abstract:
Retrieval-augmented generation (RAG) systems face significant challenges in multi-hop question answering (MHQA), where complex queries require synthesizing information across multiple document chunks. Existing approaches typically rely on iterative LLM-based query rewriting and routing, resulting in high computational costs due to repeated LLM invocations and multi-stage processes. To address these limitations, we propose TreeHop, an embedding-level framework without the need for LLMs in query refinement. TreeHop dynamically updates query embeddings by fusing semantic information from prior queries and retrieved documents, enabling iterative retrieval through embedding-space operations alone. This method replaces the traditional "Retrieve-Rewrite-Vectorize-Retrieve" cycle with a streamlined "Retrieve-Embed-Retrieve" loop, significantly reducing computational overhead. Moreover, a rule-based stop criterion is introduced to further prune redundant retrievals, balancing efficiency and recall rate. Experimental results show that TreeHop rivals advanced RAG methods across three open-domain MHQA datasets, achieving comparable performance with only 5\%-0.4\% of the model parameter size and reducing the query latency by approximately 99\% compared to concurrent approaches. This makes TreeHop a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications. For reproducibility purposes, codes and data are available here: https://github.com/allen-li1231/TreeHop-RAG.
中文: TreeHop是一种高效的嵌入级框架,通过语义融合动态更新查询嵌入,大幅降低计算成本和延迟,同时在多跳问答任务中保持优异性能。
English: TreeHop is an efficient embedding-level framework that streamlines multi-hop question answering by dynamically updating query embeddings through semantic fusion, significantly reducing computational costs and latency while maintaining competitive performance.

Authors:Damien Martins Gomes
Title: Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis
Abstract:
First-order optimization methods remain the standard for training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by preconditioning the stochastic gradient with a diagonal matrix. Despite the widespread adoption of first-order methods, second-order optimization algorithms often exhibit superior convergence compared to methods like Adam and SGD. However, their practicality in training DNNs is still limited by a significantly higher per-iteration computational cost compared to first-order methods. In this thesis, we present AdaFisher, a novel adaptive second-order optimizer that leverages a diagonal block-Kronecker approximation of the Fisher information matrix to adaptively precondition gradients. AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for training DNNs. Despite the traditionally slower speed of second-order optimizers, AdaFisher is effective for tasks such as image classification and language modeling, exhibiting remarkable stability and robustness during hyperparameter tuning. We demonstrate that AdaFisher outperforms state-of-the-art optimizers in both accuracy and convergence speed. The code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
中文: AdaFisher是一种新型自适应二阶优化器,利用Fisher信息矩阵的对角块Kronecker近似来预处理梯度,旨在将二阶方法的优越收敛性与训练深度神经网络的计算效率相结合。
English: AdaFisher is a novel adaptive second-order optimizer that uses a diagonal block-Kronecker approximation of the Fisher information matrix to precondition gradients, aiming to combine the superior convergence of second-order methods with computational efficiency for training deep neural networks.

Authors:Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli
Title: VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
Abstract:
Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.
Chinese: VideoMultiAgents框架通过整合视觉、场景图和文本处理的专业代理,并辅以问题引导的标题生成,有效提升了视频理解能力,在多个基准测试中实现了最先进的性能。
English: The VideoMultiAgents framework overcomes limitations in capturing temporal and interactive contexts by integrating specialized agents for multimodal reasoning and question-guided caption generation, achieving state-of-the-art performance across multiple benchmarks.

Authors:Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
Title: RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Abstract:
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.
中文摘要:本研究提出了用于多轮强化学习训练大语言模型智能体的StarPO框架和RAGEN系统,揭示了训练中的回声陷阱等挑战,强调需要采用稳定化训练方法和多样化奖励信号来促进智能体的有效推理能力。
English Summary: The study introduces StarPO and RAGEN frameworks for training LLM agents through multi-turn reinforcement learning, revealing challenges like Echo Trap and emphasizing the need for stabilized training methods and diverse reward signals to foster effective agent reasoning.

Authors:Moto Hira, Christian Puhrsch, Valentin Andrei, Roman Malinovskyy, Gael Le Lan, Abhinandan Krishnan, Joseph Cummings, Miguel Martin, Gokul Gunasekaran, Yuta Inoue, Alex J Turner, Raghuraman Krishnamoorthi
Title: Scalable and Performant Data Loading
Abstract:
We present SPDL (Scalable and Performant Data Loading), an open-source, framework-agnostic library designed for efficiently loading array data to GPU. Data loading is often a bottleneck in AI applications, and is challenging to optimize because it requires coordination of network calls, CPU-bound tasks, and GPU device transfer. On top of that, Python's GIL (Global Interpreter Lock) makes it difficult to gain performance improvement from multi-threading. We found that when data preprocessing functions release the GIL entirely, it is possible to execute them concurrently in a thread pool, thereby improving the workflow performance. Our benchmark shows that compared to the PyTorch DataLoader, SPDL can iterate through the ImageNet dataset 74% faster while using 38% less CPU and 50GB less memory. When training ViT-B/16 model, SPDL can send data to the GPU at a speed that does not starve the training. Additionally, when using SPDL on Python 3.13t, without changing any code, the throughput is further by improved by 33%, thanks to the disabled GIL. SPDL can improve the performance of current AI model training, and receives further performance improvements when Free-Threaded Python is adopted in production systems. SPDL is available at https://github.com/facebookresearch/spdl.
中文摘要:SPDL是一个开源、框架无关的GPU数据加载库,通过线程池并行执行绕过Python全局解释器锁限制,在ImageNet数据集上比PyTorch DataLoader迭代速度快74%且资源消耗更低。
English Summary: SPDL is an open-source, GPU-array data loading library that overcomes Python's GIL limitations through thread pool execution, achieving 74% faster ImageNet iteration and reduced resource usage compared to PyTorch DataLoader.

Authors:Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, Marc Pollefeys
Title: MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
Abstract:
While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at https://github.com/cvg/mpsfm.
中文: 本研究通过整合基于深度学习的单目深度和法线先验,克服了运动恢复结构在极端视角变化等挑战性场景中的局限,实现了即便使用少量图像也能进行鲁棒三维重建。
English: This research overcomes Structure-from-Motion's limitations in challenging scenarios by integrating deep learning-based monocular depth and normal priors, achieving robust reconstruction even with extreme viewpoint changes and few images.

Authors:Sahel Sharifymoghaddam, Shivani Upadhyay, Nandan Thakur, Ronak Pradeep, Jimmy Lin
Title: Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
Abstract:
Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, have emerged as a popular approach for assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a "good" response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations. All the code necessary to reproduce results in our work is available in https://github.com/castorini/lmsys_nuggetize.
中文摘要:竞技场评估虽广泛用于LLM和RAG系统,但缺乏解释性和诊断性,而AutoNuggetizer框架通过原子事实分解显示金块评分与人类偏好显著相关,为可解释诊断评估提供了新途径。
English Summary: Battles are popular for evaluating LLMs and RAG systems but lack explanatory and diagnostic capabilities, whereas the AutoNuggetizer framework demonstrates that nugget evaluation correlates with human preferences, offering a promising explainable and diagnostic alternative.

Authors:Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
Title: Shopformer: Transformer-Based Framework for Detecting Shoplifting via Human Pose
Abstract:
Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at https://github.com/TeCSAR-UNCC/Shopformer.
中文摘要:Shopformer提出了一种基于姿态序列分析的Transformer模型,通过将动作轨迹转换为紧凑嵌入来检测商店盗窃,在保护隐私的同时显著提升了检测效率与实时监控能力。
English Summary: Shopformer introduces a privacy-focused transformer model that detects shoplifting through pose sequence analysis instead of raw video, outperforming existing methods while addressing computational and surveillance limitations.

Authors:Yunfei Wan, Jianheng Liu, Chunran Zheng, Jiarong Lin, Fu Zhang
Title: Mesh-Learner: Texturing Mesh with Spherical Harmonics
Abstract:
In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at https://github.com/hku-mars/Mesh-Learner.
中文: 本文提出Mesh-Learner框架,通过结合网格和球谐纹理进行端到端视角相关辐射度学习,在保持与传统光栅化流程兼容的同时,在多个数据集上实现了最先进的渲染性能。
English: This paper introduces Mesh-Learner, a 3D reconstruction framework that integrates mesh and spherical harmonic textures for end-to-end view-dependent radiance learning, achieving state-of-the-art performance while maintaining compatibility with traditional rasterization pipelines and tools like Blender.

Authors:Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Title: DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images
Abstract:
This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.
中文摘要:DeeCLIP是一种利用CLIP-ViT和融合学习检测AI生成图像的创新框架,通过特征融合模块和三元组损失增强模型鲁棒性,仅用少量训练数据即在多种生成模型测试中展现优异性能。
English Summary: DeeCLIP is a robust framework that detects AI-generated images by integrating CLIP-ViT with fusion learning and triplet loss, achieving high accuracy and generalization across multiple generative models despite limited training data.

Authors:Andre Schreiber, Katherine Driggs-Campbell
Title: Do You Know the Way? Human-in-the-Loop Understanding for Fast Traversability Estimation in Mobile Robotics
Abstract:
The increasing use of robots in unstructured environments necessitates the development of effective perception and navigation strategies to enable field robots to successfully perform their tasks. In particular, it is key for such robots to understand where in their environment they can and cannot travel -- a task known as traversability estimation. However, existing geometric approaches to traversability estimation may fail to capture nuanced representations of traversability, whereas vision-based approaches typically either involve manually annotating a large number of images or require robot experience. In addition, existing methods can struggle to address domain shifts as they typically do not learn during deployment. To this end, we propose a human-in-the-loop (HiL) method for traversability estimation that prompts a human for annotations as-needed. Our method uses a foundation model to enable rapid learning on new annotations and to provide accurate predictions even when trained on a small number of quickly-provided HiL annotations. We extensively validate our method in simulation and on real-world data, and demonstrate that it can provide state-of-the-art traversability prediction performance.
中文: 所提出的人机交互方法利用基础模型,通过少量人工标注实现高效的可通行性评估,在仿真和实际场景中均达到了领先性能。
English: The proposed human-in-the-loop method leverages a foundation model to enable efficient traversability estimation through minimal human annotations, achieving state-of-the-art performance in both simulation and real-world scenarios.

Authors:Kyo Gerrits, Ana Guerberof-Arenas
Title: To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels
Abstract:
This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at https://github.com/INCREC/Pilot_to_MT_or_not_to_MT
中文摘要:该试点研究表明,翻译中的创意元素会提高认知负荷,人工翻译中最为显著,机器翻译中最低,而错误无此影响,且认知负荷增加可能提升读者的阅读乐趣和沉浸感。
English summary: This pilot study reveals that creative elements in translations increase cognitive load most in human translations and least in machine translations, while errors show no effect, with higher cognitive load potentially enhancing reader enjoyment and immersion.

Authors:Yulong Guo, Zilun Zhang, Yongheng Shang, Tiancheng Zhao, Shuiguang Deng, Yingchun Yang, Jianwei Yin
Title: SRMF: A Data Augmentation and Multimodal Fusion Approach for Long-Tail UHR Satellite Image Segmentation
Abstract:
The long-tail problem presents a significant challenge to the advancement of semantic segmentation in ultra-high-resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi-branch network architectures that emphasize multi-scale feature extraction and fusion, they have often overlooked the importance of addressing the long-tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long-tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long-tail class distribution by incorporating a multi-scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion-based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33\%, 0.66\%, and 0.98\%, respectively, achieving state-of-the-art performance. Code is available at: https://github.com/BinSpa/SRMF.git.
中文摘要:SRMF框架通过多尺度裁剪、语义重排序数据增强以及文本-视觉多模态融合,解决了超高分辨率卫星图像分割中的长尾分布问题,在多个数据集上实现了最优性能。
English Summary: The SRMF framework tackles the long-tail problem in ultra-high-resolution satellite image segmentation through multi-scale cropping, semantic reordering data augmentation, and multimodal text-visual fusion, achieving state-of-the-art performance on multiple datasets.

Authors:Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li
Title: LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Abstract:
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.
中文摘要:本文系统综述了大型语言模型驱动的手机图形界面代理从脚本自动化向智能自适应系统的演进,通过先进语言理解解决核心挑战,提出涵盖框架、方法和数据集的分类体系,并指出未来研究方向。
English Summary: This paper systematically reviews the evolution of LLM-driven phone GUI agents from script-based systems to intelligent adaptive solutions, addressing key challenges through advanced language understanding and proposing comprehensive frameworks while highlighting future research directions.

Authors:Xiaoyu Liu, Mingshuai Yao, Yabo Zhang, Xianhui Lin, Peiran Ren, Xiaoming Li, Ming Liu, Wangmeng Zuo
Title: AnimateAnywhere: Rouse the Background in Human Image Animation
Abstract:
Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.
中文摘要:AnimateAnywhere框架通过背景运动学习器从人体姿态序列中推断背景运动,无需预设相机轨迹即可生成具有生动逼真背景的人物动画,实现了最先进的性能。
English Summary: The AnimateAnywhere framework introduces a background motion learner that generates dynamic and realistic backgrounds in human image animations by inferring motion from human pose sequences, eliminating the need for predefined camera trajectories.

Authors:Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu
Title: Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
Abstract:
Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at https://github.com/HoangChuongNguyen/cope-nerf.
中文: 该方法将连续相机运动建模为时间依赖的速度,无需依赖姿态初始化或深度先验,实现了更优的相机姿态与深度估计,同时保持了具有竞争力的新视角合成效果。
English: The proposed method models continuous camera motions as time-dependent velocities to eliminate dependencies on pose initialization or depth priors, enabling superior camera pose and depth estimation while maintaining competitive novel-view synthesis performance.

Authors:Nicola Debole, Pietro Barbiero, Francesco Giannini, Andrea Passerini, Stefano Teso, Emanuele Marconato
Title: If Concept Bottlenecks are the Question, are Foundation Models the Answer?
Abstract:
Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability. CBMs work by first mapping inputs (e.g., images) to high-level concepts (e.g., visible objects and their properties) and then use these to solve a downstream task (e.g., tagging or scoring an image) in an interpretable manner. Their performance and interpretability, however, hinge on the quality of the concepts they learn. The go-to strategy for ensuring good quality concepts is to leverage expert annotations, which are expensive to collect and seldom available in applications. Researchers have recently addressed this issue by introducing "VLM-CBM" architectures that replace manual annotations with weak supervision from foundation models. It is however unclear what is the impact of doing so on the quality of the learned concepts. To answer this question, we put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics. Our results show that, depending on the task, VLM supervision can sensibly differ from expert annotations, and that concept accuracy and quality are not strongly correlated. Our code is available at https://github.com/debryu/CQA.
Chinese: 概念瓶颈模型通过将输入映射到概念来增强可解释性,但其效果依赖于概念质量;使用基础模型的弱监督替代专家标注会降低概念准确性,导致概念质量与准确性之间关联不强。
English: Concept Bottleneck Models (CBMs) enhance interpretability by mapping inputs to concepts, but their effectiveness depends on concept quality, which can be compromised when using weak supervision from foundation models instead of expert annotations, leading to discrepancies in concept accuracy and quality.

Authors:Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Title: STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction
Abstract:
3D occupancy and scene flow offer a detailed and dynamic representation of 3D scene. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel method for modeling long-term dynamic interactions, which reduces computational costs and preserves spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and mAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to 8.7GB. Our code is available on https://github.com/lzzzzzm/STCOcc
中文: 本文提出了一种基于显式状态的建模方法,通过稀疏遮挡感知注意力和级联优化策略改进三维特征重建,在占用和场景流预测中实现了更优性能并显著降低了GPU内存消耗。
English: This paper introduces an explicit state-based modeling method with sparse occlusion-aware attention and cascade refinement to enhance 3D feature renovation for occupancy and scene flow prediction, achieving superior performance and reduced GPU memory usage.

Authors:Valerie Zermatten, Javiera Castillo-Navarro, Pallavi Jain, Devis Tuia, Diego Marcos
Title: EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia
Abstract:
The presence of species provides key insights into the ecological properties of a location such as land cover, climatic conditions or even soil properties. We propose a method to predict such ecological properties directly from remote sensing (RS) images by aligning them with species habitat descriptions. We introduce the EcoWikiRS dataset, consisting of high-resolution aerial images, the corresponding geolocated species observations, and, for each species, the textual descriptions of their habitat from Wikipedia. EcoWikiRS offers a scalable way of supervision for RS vision language models (RS-VLMs) for ecology. This is a setting with weak and noisy supervision, where, for instance, some text may describe properties that are specific only to part of the species' niche or is irrelevant to a specific image. We tackle this by proposing WINCEL, a weighted version of the InfoNCE loss. We evaluate our model on the task of ecosystem zero-shot classification by following the habitat definitions from the European Nature Information System (EUNIS). Our results show that our approach helps in understanding RS images in a more ecologically meaningful manner. The code and the dataset are available at https://github.com/eceo-epfl/EcoWikiRS.
中文: 本研究提出EcoWikiRS数据集,整合了航空图像与物种栖息地描述,并开发了WINCEL加权损失方法,能在弱监督条件下从遥感图像预测生态属性,有效提升了零样本生态系统分类等任务的生态学意义解读能力。
English: The study introduces EcoWikiRS, a dataset combining aerial images with species habitat descriptions, and proposes WINCEL, a weighted loss method to predict ecological properties from remote sensing images despite noisy supervision, enhancing ecological understanding in tasks like zero-shot ecosystem classification.

Authors:Yonghui Zhai, Yang Zhang, Minghao Shang, Lihua Pang, Yaxin Ren
Title: Graph Fourier Transformer with Structure-Frequency Information
Abstract:
Graph Transformers (GTs) have shown advantages in numerous graph structure tasks but their self-attention mechanism ignores the generalization bias of graphs, with existing methods mainly compensating for this bias from aspects like position encoding, attention bias and relative distance yet still having sub-optimal performance and being insufficient by only considering the structural perspective of generalization bias. To address this, this paper proposes Grafourierformer, which innovatively combines GT with inductive bias containing Frequency-Structure information by applying Graph Fourier Transform to the Attention Matrix: specifically, eigenvalues from the Graph Laplacian matrix are used to construct an Eigenvalue matrix mask (reflecting node positions and structural relationships with neighboring nodes to enable consideration of node range structural characteristics and focus on local graph details), and inverse Fourier transform is employed to extract node high-frequency and low-frequency features, calculate low-frequency and high-frequency energy, and construct a node frequency-energy matrix to filter the eigenvalue matrix mask, allowing attention heads to incorporate both graph structural information and node frequency information optimization, adaptively distinguish global trends from local details, and effectively suppress redundant information interference. Extensive experiments on various benchmarks show Grafourierformer consistently outperforms GNN and GT-based models in graph classification and node classification tasks, with ablation experiments further validating the effectiveness and necessity of the method. Codes are available at https://github.com/Arichibald/Grafourierformer.git
中文摘要:本文提出Grafourierformer,通过图傅里叶变换将频率结构归纳偏置创新性融入图Transformer的注意力机制,同时考虑图结构关系与节点频率信息,在图表征和节点分类任务中实现了最优性能。
English Summary: This paper introduces Grafourierformer, a novel Graph Transformer that integrates frequency-structure inductive bias through Graph Fourier Transform to enhance attention mechanisms by combining structural relationships with node frequency information, achieving superior performance in graph and node classification tasks.

Authors:Yingbin Bai, Sylvie Thiebaux, Felipe Trevizan
Title: Learning Efficiency Meets Symmetry Breaking
Abstract:
Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designed to manage symmetries during search. The integration of these techniques into Fast Downward achieves a first-time success over LAMA on the latest IPC learning track dataset. Code is released at: https://github.com/bybeye/Distincter.
中文: 本文提出了一种基于图的规划表示方法,既能高效学习又能检测对称性,结合剪枝技术集成到Fast Downward后,在最新IPC数据集上首次超越了LAMA。
English: This paper introduces a graph-based planning representation that enables efficient learning and symmetry detection, along with pruning methods that, when integrated into Fast Downward, outperform LAMA on the latest IPC dataset.

Authors:Abhishek Kuriyal, Elliot Vincent, Mathieu Aubry, Loic Landrieu
Title: CoDEx: Combining Domain Expertise for Spatial Generalization in Satellite Image Analysis
Abstract:
Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts' similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods. Our code is publicly available at https://github.com/Abhishek19009/CoDEx.
中文: 针对卫星图像中地形变化带来的挑战,提出了一种新的领域泛化框架,通过训练多个专家模型并聚合其预测,在四个数据集上均取得了优于现有方法的性能提升。
English: A novel domain-generalization framework for satellite images trains multiple expert models and aggregates their predictions to overcome terrain variation challenges, demonstrating consistent improvements across four datasets.

Authors:Shengjian Fang, Yixuan Zhou, Yu Zheng, Pengyu Jiang, Siyuan Liu, Hesheng Wang
Title: UTTG_ A Universal Teleoperation Approach via Online Trajectory Generation
Abstract:
Teleoperation is crucial for hazardous environment operations and serves as a key tool for collecting expert demonstrations in robot learning. However, existing methods face robotic hardware dependency and control frequency mismatches between teleoperation devices and robotic platforms. Our approach automatically extracts kinematic parameters from unified robot description format (URDF) files, and enables pluggable deployment across diverse robots through uniform interfaces. The proposed interpolation algorithm bridges the frequency gap between low-rate human inputs and high-frequency robotic control commands through online continuous trajectory generation, \n{while requiring no access to the closed, bottom-level control loop}. To enhance trajectory smoothness, we introduce a minimum-stretch spline that optimizes the motion quality. The system further provides precision and rapid modes to accommodate different task requirements. Experiments across various robotic platforms including dual-arm ones demonstrate generality and smooth operation performance of our methods. The code is developed in C++ with python interface, and available at https://github.com/IRMV-Manipulation-Group/UTTG.
中文摘要:本方法通过自动提取URDF文件中的运动学参数,利用在线轨迹生成技术解决控制频率不匹配问题,并通过优化样条曲线提升运动平滑度,同时提供双操作模式以适应不同任务需求,实现了跨机器人的即插即用遥操作。
English Summary: Our method enables pluggable teleoperation across diverse robots by automatically extracting kinematic parameters from URDF files and bridging control frequency gaps through online trajectory generation, while enhancing motion smoothness with optimized splines and offering dual operation modes for different tasks.

Authors:Nikolaos Chaidos, Angeliki Dimitriou, Nikolaos Spanos, Athanasios Voulodimos, Giorgos Stamou
Title: Explaining Vision GNNs: A Semantic and Visual Analysis of Graph-based Image Classification
Abstract:
Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers' robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models' explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.
中文摘要:本研究通过分析图神经网络各层语义一致性和可视化信息流动,评估其在图像分类中的可解释性,发现虽然模型决策可被有效解释,但其深层推理逻辑与人类感知存在差异。
English Summary: This study evaluates the explainability of Graph Neural Networks (GNNs) in image classification by analyzing semantic consistency across layers and visualizing information flow, revealing that while model decisions are interpretable, their reasoning diverges from human perception in deeper layers.

Authors:Biqing Duan, Qing Wang, Di Liu, Wei Zhou, Zhenli He, Shengfa Miao
Title: LODAP: On-Device Incremental Learning Via Lightweight Operations and Data Pruning
Abstract:
Incremental learning that learns new classes over time after the model's deployment is becoming increasingly crucial, particularly for industrial edge systems, where it is difficult to communicate with a remote server to conduct computation-intensive learning. As more classes are expected to learn after their execution for edge devices. In this paper, we propose LODAP, a new on-device incremental learning framework for edge systems. The key part of LODAP is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes so that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of LODAP is further enhanced by a data pruning strategy that significantly reduces the training data, thereby lowering the training overhead. We conducted extensive experiments on the CIFAR-100 and Tiny- ImageNet datasets. Experimental results show that LODAP improves the accuracy by up to 4.32\% over existing methods while reducing around 50\% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code is available at https://github.com/duanbiqing/LODAP.
Chinese: LODAP是一种设备端增量学习框架,通过高效增量模块和数据剪枝策略,在提高精度的同时显著降低了模型复杂度和训练开销。
English: LODAP is an on-device incremental learning framework that enhances accuracy while reducing model complexity and training overhead through its Efficient Incremental Module and data pruning strategy.

Authors:Haroui Ma, Francesco Quinzan, Theresa Willem, Stefan Bauer
Title: AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis
Abstract:
Machine learning (ML) systems for medical imaging have demonstrated remarkable diagnostic capabilities, but their susceptibility to biases poses significant risks, since biases may negatively impact generalization performance. In this paper, we introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics. Our method leverages the concept of counterfactual invariance, measuring the extent to which a model's predictions remain unchanged under hypothetical changes to sensitive attributes. We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases without requiring direct access to counterfactual data. Through experiments on synthetic datasets and large-scale real-world medical imaging datasets, including \textsc{cheXpert} and MIMIC-CXR, we demonstrate that our approach aligns closely with counterfactual fairness principles and outperforms standard baselines. This work provides a robust tool to ensure that ML diagnostic systems generalize well, e.g., across demographic groups, offering a critical step towards AI safety in healthcare. Code: https://github.com/Neferpitou3871/AI-Alignment-Medical-Imaging.
中文: 本文提出了一种基于反事实不变性的新型统计框架,用于评估和减轻医学影像机器学习模型中的偏见,通过在真实数据集上的实验证明该方法能有效提升模型在不同人口群体间的泛化性能。
English: This paper introduces a novel statistical framework using counterfactual invariance to evaluate and mitigate biases in medical imaging ML models, demonstrating its effectiveness in improving generalization across demographic groups through experiments on real-world datasets.

Authors:Kitsuya Azuma, Takayuki Nishio, Yuichi Kitagawa, Wakako Nakano, Takahito Tanimura
Title: Soft-Label Caching and Sharpening for Communication-Efficient Federated Distillation
Abstract:
Federated Learning (FL) enables collaborative model training across decentralized clients, enhancing privacy by keeping data local. Yet conventional FL, relying on frequent parameter-sharing, suffers from high communication overhead and limited model heterogeneity. Distillation-based FL approaches address these issues by sharing predictions (soft-labels) instead, but they often involve redundant transmissions across communication rounds, reducing efficiency. We propose SCARLET, a novel framework integrating synchronized soft-label caching and an enhanced Entropy Reduction Aggregation (Enhanced ERA) mechanism. SCARLET minimizes redundant communication by reusing cached soft-labels, achieving up to 50% reduction in communication costs compared to existing methods while maintaining accuracy. Enhanced ERA can be tuned to adapt to non-IID data variations, ensuring robust aggregation and performance in diverse client scenarios. Experimental evaluations demonstrate that SCARLET consistently outperforms state-of-the-art distillation-based FL methods in terms of accuracy and communication efficiency. The implementation of SCARLET is publicly available at https://github.com/kitsuyaazuma/SCARLET.
Chinese: SCARLET是一种新颖的联邦学习框架,通过同步软标签缓存和增强的熵减聚合机制,在保持不同数据分布下高精度的同时,将通信成本降低了高达50%。
English: SCARLET is a novel federated learning framework that reduces communication costs by up to 50% through synchronized soft-label caching and an enhanced entropy reduction aggregation mechanism, while maintaining high accuracy across diverse data distributions.

Authors:Seongmin Hwang, Daeyoung Han, Moongu Jeon
Title: DG-DETR: Toward Domain Generalized Detection Transformer
Abstract:
End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.
中文: DG-DETR通过提出领域无关查询选择和小波分解的即插即用方法,有效提升了DETR检测器在分布外场景下的鲁棒性,填补了Transformer检测器在领域泛化研究中的空白。
English: DG-DETR introduces a plug-and-play method using domain-agnostic query selection and wavelet decomposition to enhance DETRs' out-of-distribution robustness, effectively addressing domain generalization gaps in transformer-based detectors.

Authors:Yasir Ghunaim, Andrés Villa, Gergo Ignacz, Gyorgy Szekely, Motasem Alfarra, Bernard Ghanem
Title: Towards Faster and More Compact Foundation Models for Molecular Property Prediction
Abstract:
Advancements in machine learning for molecular property prediction have improved accuracy but at the expense of higher computational cost and longer training times. Recently, the Joint Multi-domain Pre-training (JMP) foundation model has demonstrated strong performance across various downstream tasks with reduced training time over previous models. Despite JMP's advantages, fine-tuning it on molecular datasets ranging from small-scale to large-scale requires considerable time and computational resources. In this work, we investigate strategies to enhance efficiency by reducing model size while preserving performance. To better understand the model's efficiency, we analyze the layer contributions of JMP and find that later interaction blocks provide diminishing returns, suggesting an opportunity for model compression. We explore block reduction strategies by pruning the pre-trained model and evaluating its impact on efficiency and accuracy during fine-tuning. Our analysis reveals that removing two interaction blocks results in a minimal performance drop, reducing the model size by 32% while increasing inference throughput by 1.3x. These results suggest that JMP-L is over-parameterized and that a smaller, more efficient variant can achieve comparable performance with lower computational cost. Our study provides insights for developing lighter, faster, and more scalable foundation models for molecular and materials discovery. The code is publicly available at: https://github.com/Yasir-Ghunaim/efficient-jmp.
Chinese: 研究表明,联合多领域预训练(JMP)模型存在参数冗余,通过剪除两个交互模块可在保持性能基本不变的同时实现模型体积减小32%、推理速度提升1.3倍,为分子发现提供了更高效的基础模型解决方案。
English: This study demonstrates that the Joint Multi-domain Pre-training (JMP) model is over-parameterized and can be compressed by removing two interaction blocks, achieving a 32% size reduction and 1.3x faster inference with minimal performance loss, offering a more efficient foundation model for molecular discovery.

Authors:Peijian Zeng, Feiyan Pang, Zhanbo Wang, Aimin Yang
Title: LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning
Abstract:
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8% and 11.1% of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing. Code to reproduce the experiment is available at https://github.com/LilaKen/LR-IAD.
中文: 本研究提出了一种无需掩码标注的推理框架和动态奖励函数,以解决工业异常检测中的类别不平衡问题,在MVTec-AD和VisA数据集上分别实现了36%和16%的准确率提升,同时提供可解释的缺陷定位,显著降低了实施成本。
English: This study introduces a mask-free reasoning framework with a dynamic reward function to address class imbalance in industrial anomaly detection, achieving state-of-the-art accuracy improvements of 36% on MVTec-AD and 16% on VisA while providing interpretable defect localization without costly mask annotations.

Authors:Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, Guohao Dai, Yu Wang
Title: Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
Abstract:
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency becomes an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, which utilizes a novel signaling mechanism: when part of the output finishes, the computation kernel sends a signal to trigger the communication of that part, while continuing the computation of the remaining part (interference-free computation). Consequently, the communication of the finished part and the computation of the remaining part can be overlapped. On top of the signaling mechanism, FlashOverlap comprises two key components: (1) the determination of the signaling timing to boost the overlap efficiency (tile-wise overlapping), and (2) a pre-communication reordering to create the contiguous address for finished data, enabling communication by simply calling NCCL APIs (communication agnosticism), and a post-communication reordering to correct the data order. Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases. Code is available at https://github.com/infinigence/FlashOverlap.
Chinese: FlashOverlap系统通过创新的信号触发机制,在保持计算性能的同时实现分块级计算与通信重叠,有效解决了多GPU系统中通信瓶颈问题,最高可获得1.65倍的加速效果。
English: The proposed FlashOverlap system overcomes inter-GPU communication bottlenecks in generative models by enabling interference-free, tile-wise overlapping of computation and communication through a novel signaling mechanism, achieving up to 1.65x speedup.

Authors:Xinyang Li, Chengjie Yi, Jiawei Lai, Mingbao Lin, Yansong Qu, Shengchuan Zhang, Liujuan Cao
Title: SynergyAmodal: Deocclude Anything with Text Control
Abstract:
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.
中文: SynergyAmodal框架通过整合真实图像、人类专业知识和生成先验,共同合成包含1.6万对样本的多样化数据,有效解决了图像去遮挡领域高质量数据稀缺的问题,实现了优异的零样本泛化能力和文本可控性。
English: The SynergyAmodal framework addresses the scarcity of high-quality amodal completion data by integrating in-the-wild images, human expertise, and generative priors to co-synthesize a diverse 16K-pair dataset, enabling effective zero-shot generalization and textual controllability.

Authors:Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Title: Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Abstract:
Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: https://github.com/Prisma-Multimodal/ViT-Prisma), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.
中文:Prisma是一个开源框架,通过提供统一的工具包来访问视觉变换器、训练工具、预训练权重和分析资源,加速视觉机制可解释性研究,同时揭示了视觉稀疏自编码器相比语言模型具有更低稀疏性等惊人发现。
English: Prisma is an open-source framework that accelerates vision mechanistic interpretability research by providing a unified toolkit for accessing vision transformers, training tools, pre-trained weights, and analytical resources, while revealing surprising findings such as lower sparsity in vision SAEs compared to language models.

Authors:Yejin Jeong, Donghun Lee
Title: CLIP-KOA: Enhancing Knee Osteoarthritis Diagnosis with Multi-Modal Learning and Symmetry-Aware Loss Functions
Abstract:
Knee osteoarthritis (KOA) is a universal chronic musculoskeletal disorders worldwide, making early diagnosis crucial. Currently, the Kellgren and Lawrence (KL) grading system is widely used to assess KOA severity. However, its high inter-observer variability and subjectivity hinder diagnostic consistency. To address these limitations, automated diagnostic techniques using deep learning have been actively explored in recent years. In this study, we propose a CLIP-based framework (CLIP-KOA) to enhance the consistency and reliability of KOA grade prediction. To achieve this, we introduce a learning approach that integrates image and text information and incorporate Symmetry Loss and Consistency Loss to ensure prediction consistency between the original and flipped images. CLIP-KOA achieves state-of-the-art accuracy of 71.86\% on KOA severity prediction task, and ablation studies show that CLIP-KOA has 2.36\% improvement in accuracy over the standard CLIP model due to our contribution. This study shows a novel direction for data-driven medical prediction not only to improve reliability of fine-grained diagnosis and but also to explore multimodal methods for medical image analysis. Our code is available at https://github.com/anonymized-link.
中文: 本研究提出的CLIP-KOA框架通过融合图像与文本信息并采用对称性损失和一致性损失,显著提升了膝骨关节炎严重程度预测的准确性和可靠性,达到了71.86%的最新准确率。
English: This study introduces CLIP-KOA, a deep learning framework that integrates image and text data with specialized loss functions to enhance the consistency and accuracy of knee osteoarthritis severity prediction, achieving state-of-the-art 71.86% accuracy.

Authors:Dehao Yuan, Cornelia Fermüller
Title: A Real-Time Event-Based Normal Flow Estimator
Abstract:
This paper presents a real-time, asynchronous, event-based normal flow estimator. It follows the same algorithm as Learning Normal Flow Directly From Event Neighborhoods, but with a more optimized implementation. The original method treats event slices as 3D point clouds, encodes each event's local geometry into a fixed-length vector, and uses a multi-layer perceptron to predict normal flow. It constructs representations by multiplying an adjacency matrix with a feature matrix, resulting in quadratic time complexity with respect to the number of events. In contrast, we leverage the fact that event coordinates are integers and reformulate the representation step as a pooling operation. This achieves the same effect as the adjacency matrix but with much lower computational cost. As a result, our method supports real-time normal flow prediction on event cameras. Our estimator uses 1 GB of CUDA memory and runs at 4 million normal flows per second on an RTX 3070, or 6 million per second on an RTX A5000. We release the CUDA implementation along with a Python interface at https://github.com/dhyuan99/VecKM_flow_cpp.
中文: 本文提出了一种实时、异步、基于事件的法向流估计器,通过将表示步骤重新定义为池化操作,显著降低了计算复杂度,从而在事件相机上实现了高效的法向流预测。
English: This paper introduces a real-time, asynchronous, event-based normal flow estimator that optimizes the original method by reformulating the representation step as a pooling operation, reducing computational complexity and enabling efficient performance on event cameras.

Authors:Mengxia Yu, Bang Nguyen, Olivia Zino, Meng Jiang
Title: Context Selection and Rewriting for Video-based Educational Question Generation
Abstract:
Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in https://github.com/mengxiayu/COSER.
中文摘要:本研究提出了一种利用大型语言模型动态筛选和重写教育视频上下文的新框架,通过增强与时间戳和目标答案的匹配度,解决了生成准确相关教育问题的挑战。
English Summary: This study introduces a novel framework using large language models to dynamically select and rewrite contexts from educational videos, addressing challenges in generating accurate and relevant questions by improving alignment with timestamps and target answers.

Authors:Jiahao Lu, Chong Yin, Silvia Ingala, Kenny Erleben, Michael Bachmann Nielsen, Sune Darkner
Title: MERA: Multimodal and Multiscale Self-Explanatory Model with Considerably Reduced Annotation for Lung Nodule Diagnosis
Abstract:
Lung cancer, a leading cause of cancer-related deaths globally, emphasises the importance of early detection for better patient outcomes. Pulmonary nodules, often early indicators of lung cancer, necessitate accurate, timely diagnosis. Despite Explainable Artificial Intelligence (XAI) advances, many existing systems struggle providing clear, comprehensive explanations, especially with limited labelled data. This study introduces MERA, a Multimodal and Multiscale self-Explanatory model designed for lung nodule diagnosis with considerably Reduced Annotation requirements. MERA integrates unsupervised and weakly supervised learning strategies (self-supervised learning techniques and Vision Transformer architecture for unsupervised feature extraction) and a hierarchical prediction mechanism leveraging sparse annotations via semi-supervised active learning in the learned latent space. MERA explains its decisions on multiple levels: model-level global explanations via semantic latent space clustering, instance-level case-based explanations showing similar instances, local visual explanations via attention maps, and concept explanations using critical nodule attributes. Evaluations on the public LIDC dataset show MERA's superior diagnostic accuracy and self-explainability. With only 1% annotated samples, MERA achieves diagnostic accuracy comparable to or exceeding state-of-the-art methods requiring full annotation. The model's inherent design delivers comprehensive, robust, multilevel explanations aligned closely with clinical practice, enhancing trustworthiness and transparency. Demonstrated viability of unsupervised and weakly supervised learning lowers the barrier to deploying diagnostic AI in broader medical domains. Our complete code is open-source available: https://github.com/diku-dk/credanno.
中文: 本研究提出MERA多模态自解释模型,通过结合无监督学习和分层解释机制,在仅需少量标注的情况下实现肺结节的高精度诊断,并提供符合临床实践的多层次解释。
English: This study introduces MERA, a multimodal self-explanatory model that achieves high diagnostic accuracy for lung nodules with minimal annotation by combining unsupervised learning and hierarchical explanations, while providing comprehensive clinical insights.

Authors:Pascal Roth, Jonas Frey, Cesar Cadena, Marco Hutter
Title: Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation
Abstract:
Ensuring safe navigation in complex environments requires accurate real-time traversability assessment and understanding of environmental interactions relative to the robot`s capabilities. Traditional methods, which assume simplified dynamics, often require designing and tuning cost functions to safely guide paths or actions toward the goal. This process is tedious, environment-dependent, and not generalizable. To overcome these issues, we propose a novel learned perceptive Forward Dynamics Model (FDM) that predicts the robot`s future state conditioned on the surrounding geometry and history of proprioceptive measurements, proposing a more scalable, safer, and heuristic-free solution. The FDM is trained on multiple years of simulated navigation experience, including high-risk maneuvers, and real-world interactions to incorporate the full system dynamics beyond rigid body simulation. We integrate our perceptive FDM into a zero-shot Model Predictive Path Integral (MPPI) planning framework, leveraging the learned mapping between actions, future states, and failure probability. This allows for optimizing a simplified cost function, eliminating the need for extensive cost-tuning to ensure safety. On the legged robot ANYmal, the proposed perceptive FDM improves the position estimation by on average 41% over competitive baselines, which translates into a 27% higher navigation success rate in rough simulation environments. Moreover, we demonstrate effective sim-to-real transfer and showcase the benefit of training on synthetic and real data. Code and models are made publicly available under https://github.com/leggedrobotics/fdm.
中文: 提出的感知前向动力学模型通过整合环境几何与本体感知历史来预测机器人未来状态,无需启发式调优即可实现更安全的导航,在复杂地形中成功率提升27%。
English: The proposed perceptive Forward Dynamics Model (FDM) predicts a robot's future state by integrating environmental geometry and proprioceptive history, enabling safer navigation without heuristic tuning and achieving a 27% higher success rate in rough terrain.

Authors:Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua
Title: BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Abstract:
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
中文: 针对现有基准忽略中文网络复杂性的问题,BrowseComp-ZH作为高难度中文网页评估基准被提出,大多数模型在其测试中表现不佳,凸显了当前模型在检索与推理能力上的不足。
English: To address the lack of benchmarks for evaluating LLM agents on the Chinese web, BrowseComp-ZH is introduced as a high-difficulty, multi-domain dataset where most models perform poorly, highlighting the challenges in retrieval and reasoning.

Authors:Ni Yao, Xiangyu Liu, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Weihua Zhou, Chen Zhao
Title: Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography
Abstract:
Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at http://github.com/chenzhao2023/MGFA_CCTA
Chinese: 本研究提出心肌区域引导特征聚合网络,通过整合解剖学先验知识和多尺度特征融合的双编码器架构,显著提升了冠状动脉分割和狭窄检测性能,实现了85.04%的Dice分数等优越指标。
English: This study introduces the Myocardial Region-guided Feature Aggregation Net, a dual-encoder architecture that enhances coronary artery segmentation and stenosis detection by integrating anatomical priors and multi-scale feature fusion, achieving superior performance metrics including an 85.04% Dice score.

Authors:Hanyu Lai, Junjie Gao, Xiao Liu, Yifan Xu, Shudan Zhang, Yuxiao Dong, Jie Tang
Title: AndroidGen: Building an Android Language Agent under Data Scarcity
Abstract:
Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.
中文: AndroidGen框架通过生成高质量数据轨迹来训练大型语言模型,解决了其在移动设备代理应用中数据稀缺和性能不足的问题,无需人工标注,并在多个基准测试中验证了其有效性。
English: The AndroidGen framework addresses the limitations of large language models in mobile agent applications by generating high-quality data trajectories for training, thereby improving performance without manual annotation, as validated across multiple benchmarks.

Authors:Shuhao Kang, Martin Y. Liao, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers
Title: OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion
Abstract:
LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel framework for LiDAR place recognition that leverages OpenStreetMap (OSM) as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components. First, a cross-modal visibility mask that identifies observable regions from both modalities to guide feature alignment. Second, an adaptive radial fusion module that dynamically consolidates radial features into discriminative global descriptors. Extensive experiments on KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at 1m threshold for top-1 retrieved matches, along with 12x faster inference speed compared to the state-of-the-art approach. Code and data are publicly available at: https://github.com/kang-1-2-3/OPAL.
中文摘要:本文提出OPAL这一新型激光雷达地点识别框架,利用开放街道地图作为轻量级先验,通过跨模态可见性掩码和自适应径向融合模块弥合数据差异,在精度和推理速度上均显著超越现有最优方法。
English Summary: The paper introduces OPAL, a novel LiDAR place recognition framework that utilizes OpenStreetMap as a lightweight prior, overcoming domain gaps through cross-modal visibility masks and adaptive radial fusion to achieve superior accuracy and faster inference than existing methods.

Authors:Dylan Bouchard, Mohit Singh Chauhan
Title: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Abstract:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
中文: 本文提出了一种无需外部资源的通用框架,通过将多种不确定性量化技术转化为标准化置信度分数,并采用可调整的集成方法,有效检测大语言模型的幻觉问题,其性能优于现有方法。
English: This paper introduces a versatile zero-resource framework for detecting hallucinations in Large Language Models by adapting uncertainty quantification techniques into standardized confidence scores and proposing a tunable ensemble approach that outperforms existing methods.

Authors:Loc Phuc Truong Nguyen, Hung Truong Thanh Nguyen, Hung Cao
Title: ODExAI: A Comprehensive Object Detection Explainable AI Evaluation
Abstract:
Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comprehensive framework designed to assess XAI methods in object detection based on three core dimensions: localization accuracy, faithfulness to model behavior, and computational complexity. We benchmark a set of XAI methods across two widely used object detectors (YOLOX and Faster R-CNN) and standard datasets (MS-COCO and PASCAL VOC). Empirical results demonstrate that region-based methods (e.g., D-CLOSE) achieve strong localization (PG = 88.49%) and high model faithfulness (OA = 0.863), though with substantial computational overhead (Time = 71.42s). On the other hand, CAM-based methods (e.g., G-CAME) achieve superior localization (PG = 96.13%) and significantly lower runtime (Time = 0.54s), but at the expense of reduced faithfulness (OA = 0.549). These findings demonstrate critical trade-offs among existing XAI approaches and reinforce the need for task-specific evaluation when deploying them in object detection pipelines. Our implementation and evaluation benchmarks are publicly available at: https://github.com/Analytics-Everywhere-Lab/odexai.
中文: ODExAI框架从定位准确性、模型忠实性和计算效率三个维度评估目标检测可解释AI方法,揭示了基于区域与基于CAM方法之间的性能权衡。
English: The ODExAI framework evaluates XAI methods for object detection across localization, faithfulness, and efficiency, revealing trade-offs between region-based and CAM-based approaches.

Authors:De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao
Title: Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID
Abstract:
Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{https://github.com/FranklinLingfeng/code-for-SALCR}.
中文:提出的SALCR框架通过语义对齐学习和协作优化模块,在无监督可见光-红外行人重识别中解决跨模态差异问题,实现细粒度模式优化的跨模态互补对齐。
English: The proposed SALCR framework addresses cross-modality variations in unsupervised visible-infrared person re-identification by introducing semantic-aligned learning and collaborative refinement modules to achieve complementary alignment between modalities through fine-grained pattern optimization.

Authors:Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, Xiang Wang
Title: AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings
Abstract:
Recent advancements in sequential recommendation have underscored the potential of Large Language Models (LLMs) for enhancing item embeddings. However, existing approaches face three key limitations: 1) the degradation of the semantic space when high-dimensional language embeddings are mapped to lower-dimensional ID embeddings, 2) the underutilization of language embeddings, and 3) the reliance on additional trainable parameters, such as an adapter, to bridge the gap between the semantic and behavior spaces. In this paper, we introduce AlphaFuse, a simple but effective language-guided learning strategy that addresses these challenges by learning ID embeddings within the null space of language embeddings. Specifically, we decompose the semantic space of language embeddings via Singular Value Decomposition (SVD), distinguishing it into a semantic-rich row space and a semantic-sparse null space. Collaborative signals are then injected into the null space, while preserving the rich semantics of the row space. AlphaFuse prevents degradation of the semantic space, integrates the retained language embeddings into the final item embeddings, and eliminates the need for auxiliary trainable modules, enabling seamless adaptation to any sequential recommendation framework. We validate the effectiveness and flexibility of AlphaFuse through extensive experiments on three benchmark datasets, including cold-start user and long-tail settings, showcasing significant improvements in both discriminative and diffusion-based generative sequential recommenders. Our codes and datasets are available at https://github.com/Hugo-Chinn/AlphaFuse.
中文: AlphaFuse是一种创新的语言引导学习策略,通过在语言嵌入的零空间中学习ID嵌入来增强序列推荐,既保持了语义完整性又无需额外可训练模块,并在多种设置下展现出显著的性能提升。
English: AlphaFuse is a novel language-guided learning strategy that enhances sequential recommendation by learning ID embeddings in the null space of language embeddings, preserving semantic integrity and eliminating the need for additional trainable modules while demonstrating significant performance improvements across various settings.

Authors:Yuming Zhao, Qijian Zhang, Junhui Hou, Jiazhi Xia, Wenping Wang, Ying He
Title: FlexPara: Flexible Neural Surface Parameterization
Abstract:
Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara
中文摘要:FlexPara是一种无监督神经优化框架,通过建立三维表面点与自适应变形二维UV坐标之间的映射,实现了无需人工切割缝的灵活全局和多图表表面参数化。
English Summary: FlexPara is an unsupervised neural framework that enables flexible global and multi-chart surface parameterization through adaptive point-wise mappings between 3D surfaces and 2D UV coordinates, eliminating the need for manual cutting seams.

Authors:Jianlong Chen, Chao Li, Yang Yuan, Andrew C Yao
Title: Hierarchical Attention Generates Better Proofs
Abstract:
Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.
中文: 分层注意力是一种正则化方法,通过将大语言模型的注意力机制与数学推理结构对齐,提高了定理证明的成功率并降低了证明复杂度。
English: Hierarchical Attention is a regularization method that aligns LLMs' attention with mathematical reasoning structures, improving proof success rates and reducing complexity in theorem proving.

Authors:Zhangshuo Qi, Luqi Cheng, Zijie Zhou, Guangming Xiong
Title: LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition
Abstract:
In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird's eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.
中文: LRFusionPR通过双分支网络在统一鸟瞰图表示中融合激光雷达与雷达数据,提升了自动驾驶场景识别的精度和恶劣天气下的鲁棒性。
English: LRFusionPR enhances autonomous driving place recognition by fusing LiDAR and radar data through a dual-branch network in a unified BEV representation, achieving improved accuracy and weather robustness.

Authors:Zhangshuo Qi, Luqi Cheng, Zijie Zhou, Guangming Xiong
Title: LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition
Abstract:
In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird's eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.
中文: LRFusionPR通过双分支网络在统一鸟瞰图表示中融合激光雷达与雷达数据,提升了自动驾驶场景识别的精度和恶劣天气下的鲁棒性。
English: LRFusionPR enhances autonomous driving place recognition by fusing LiDAR and radar data through a dual-branch network in a unified BEV representation, achieving improved accuracy and weather robustness.

Authors:Zhikai Wang, Yanyan Shen, Zexi Zhang, Li He, Yichun Li, Hao Gu, Yinghua Zhang
Title: Relative Contrastive Learning for Sequential Recommendation with Similarity-based Positive Pair Selection
Abstract:
Contrastive Learning (CL) enhances the training of sequential recommendation (SR) models through informative self-supervision signals. Existing methods often rely on data augmentation strategies to create positive samples and promote representation invariance. Some strategies such as item reordering and item substitution may inadvertently alter user intent. Supervised Contrastive Learning (SCL) based methods find an alternative to augmentation-based CL methods by selecting same-target sequences (interaction sequences with the same target item) to form positive samples. However, SCL-based methods suffer from the scarcity of same-target sequences and consequently lack enough signals for contrastive learning. In this work, we propose to use similar sequences (with different target items) as additional positive samples and introduce a Relative Contrastive Learning (RCL) framework for sequential recommendation. RCL comprises a dual-tiered positive sample selection module and a relative contrastive learning module. The former module selects same-target sequences as strong positive samples and selects similar sequences as weak positive samples. The latter module employs a weighted relative contrastive loss, ensuring that each sequence is represented closer to its strong positive samples than its weak positive samples. We apply RCL on two mainstream deep learning-based SR models, and our empirical results reveal that RCL can achieve 4.88% improvement averagely than the state-of-the-art SR methods on five public datasets and one private dataset.
Chinese: 本文提出了一种用于序列推荐的相对对比学习(RCL)框架,通过将同目标序列作为强正样本和相似序列作为弱正样本来提升模型性能,在多个数据集上平均比现有最优方法提高了4.88%。
English: This paper introduces a Relative Contrastive Learning (RCL) framework for sequential recommendation, which uses same-target sequences as strong positives and similar sequences as weak positives to enhance model performance, achieving an average 4.88% improvement over state-of-the-art methods.

Authors:Piotr Migus
Title: Newton-Puiseux Analysis for Interpretability and Calibration of Complex-Valued Neural Networks
Abstract:
Complex-valued neural networks (CVNNs) excel where phase matters, yet their multi-sheeted decision surfaces defy standard explainability and calibration tools. We propose a \emph{Newton-Puiseux} framework that fits a local polynomial surrogate to a high-uncertainty input and analytically decomposes this surrogate into fractional-power series. The resulting Puiseux expansions, dominant Puiseux coefficients, and phase-aligned curvature descriptors deliver closed-form estimates of robustness and over-confidence that gradient - or perturbation-based methods (saliency, LIME, SHAP) cannot provide. On a controlled $\mathbb{C}^2$ helix the surrogate attains RMSE $< 0.09$ while recovering the number of decision sheets; quartic coefficients predict adversarial flip radii within $10^{-3}$. On the real-world MIT-BIH arrhythmia corpus, Puiseux-guided, phase-aware temperature scaling lowers expected calibration error from 0.087 to 0.034, contributing to the advancement of CVNNs. Full code, pre-trained weights, and scripts are at https://github.com/piotrmgs/puiseux-cvnn.
Chinese: 提出的牛顿-普伊瑟框架通过将局部多项式代理分解为分数幂级数,能够对复数神经网络的鲁棒性和过度自信进行闭式估计,在合成和真实数据集上均实现了更好的校准效果和预测精度。
English: The proposed Newton-Puiseux framework enables closed-form estimation of robustness and over-confidence in complex-valued neural networks by decomposing local polynomial surrogates into fractional-power series, achieving improved calibration and predictive accuracy on both synthetic and real-world datasets.

Authors:Xin Li, Kaikai Jia, Hao Sun, Jun Dai, Ziyang Jiang
Title: Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget
Abstract:
Recent advancements in text-to-speech (TTS) models have been driven by the integration of large language models (LLMs), enhancing semantic comprehension and improving speech naturalness. However, existing LLM-based TTS models often lack open-source training code and efficient inference acceleration frameworks, limiting their accessibility and adaptability. Additionally, there is no publicly available TTS model specifically optimized for podcast scenarios, which are in high demand for voice interaction applications. To address these limitations, we introduce Muyan-TTS, an open-source trainable TTS model designed for podcast applications within a $50,000 budget. Our model is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices. In addition to open-sourcing the model, we provide a comprehensive data collection and processing pipeline, a full training procedure, and an optimized inference framework that accelerates LLM-based TTS synthesis. Our code and models are available at https://github.com/MYZY-AI/Muyan-TTS.
中文:Muyan-TTS是一款专为播客场景优化的开源可训练文本转语音模型,具备高质量零样本合成和说话人自适应功能,并提供完整的训练流程与加速推理框架。
English: Muyan-TTS is an open-source, trainable text-to-speech model optimized for podcast scenarios, featuring high-quality zero-shot synthesis and speaker adaptation while providing complete training and accelerated inference frameworks.

Authors:Bowei Wang, Jiaran Gao, Yelai Feng, Renzhi Chen, Shanshan Li, Lei Wang
Title: ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development
Abstract:
The growing demand for Domain-Specific Architecture (DSA) has driven the development of Agile Hardware Development Methodology (AHDM). Hardware Construction Language (HCL) like Chisel offers high-level abstraction features, making it an ideal language for HCL-Based AHDM. While Large Language Models (LLMs) excel in code generation tasks, they still face challenges with Chisel generation, particularly regarding syntax correctness and design variability. Recent reasoning models have significantly enhanced code generation capabilities through test-time scaling techniques. However, we found that reasoning models without domain adaptation cannot bring substantial benefits to Chisel code generation tasks. This paper presents ChiseLLM, a solution comprising data processing and transformation, prompt-guided reasoning trace synthesis, and domain-adapted model training. We constructed high-quality datasets from public RTL code resources and guided the model to adopt structured thinking patterns through prompt enhancement methods. Experiments demonstrate that our ChiseLLM-7B and ChiseLLM-32B models improved syntax correctness by 18.85% and 26.32% respectively over base models, while increasing variability design ability by 47.58% compared to baseline reasoning models. Our datasets and models are publicly available, providing high-performance, cost-effective models for HCL-Based AHDM, and offering an effective baseline for future research. Github repository: https://github.com/observerw/ChiseLLM
Chinese: 本文提出的ChiseLLM通过领域自适应训练和结构化思维引导,显著提升了Chisel代码生成的语法正确性和设计多样性,为基于硬件构造语言的敏捷开发提供了高效解决方案。
English: This paper introduces ChiseLLM, a domain-adapted framework that significantly enhances syntax correctness and design variability in Chisel code generation through specialized data processing and prompt-guided reasoning.

Authors:Huiling Zheng, Xian Zhong, Bin Liu, Yi Xiao, Bihan Wen, Xiaofeng Li
Title: PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification
Abstract:
The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and underutilized spectral complementarity. Existing methods often fail to decouple shared structural features from modality-complementary radiometric attributes, causing feature conflicts and information loss. To address this, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-complementary) components in the Fourier domain, thus reinforcing shared structures while preserving complementary characteristics to improve fusion quality. Unlike prior approaches that overlook the distinct physical properties encoded in frequency spectra, PAD is the first to introduce explicit amplitude-phase decoupling for multi-modal fusion. Specifically, PAD comprises two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features via convolution-guided scaling to enhance geometric consistency; and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high-frequency and low-frequency patterns using frequency-adaptive multilayer perceptrons, leveraging SAR's morphological sensitivity and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance. Our work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.
中文: 本文提出相位-振幅解耦(PAD)框架,通过在傅里叶域分离相位和振幅分量,强化共享结构并保留互补特征,有效提升了SAR与RGB影像在多模态融合中的土地覆盖分类性能。
English: This paper introduces Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase and amplitude components in the Fourier domain to enhance multi-modal fusion of SAR and RGB imagery for land cover classification by reinforcing shared structures while preserving complementary characteristics.

Authors:Jialang Lu, Huayu Zhao, Huiyu Zhai, Xingxing Yang, Shini Han
Title: DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning
Abstract:
There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at https://github.com/Wenyuzhy/DeepSPG.
中文: 提出的DeepSPG框架通过整合图像分割和自然语言的多模态语义先验来增强低光图像,在多个基准测试中超越了现有方法。
English: The proposed DeepSPG framework enhances low-light images by integrating multimodal semantic priors from both image segmentation and natural language, outperforming existing methods across multiple benchmarks.

Authors:Jikai Wang, Juntao Li, Jianye Hou, Bowen Yan, Lijun Wu, Min Zhang
Title: Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Abstract:
Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy of the target model for complex tasks. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% and 21\%$\sim$49\% for Deepseek-R1-Distill-Qwen-32B and Deepseek-R1-Distill-Llama-70B while achieving near-target-model-level performance. Our code is available at https://github.com/Jikai0Wang/Speculative_CoT.
中文: 本文提出的推测性思维链(SCoT)方法通过大小模型协作加速平均推理速度,在保持接近目标模型性能的同时,将推理延迟降低了21%至66%。
English: The paper introduces Speculative Chain-of-Thought (SCoT), a method that reduces reasoning latency by accelerating average reasoning speed through collaboration between large and small models, achieving near-target-model performance while cutting latency by 21% to 66% across various benchmarks.

Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao
Title: Versatile Framework for Song Generation with Prompt-based Control
Abstract:
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
Chinese: VersBand 是一个多任务歌曲生成框架,通过专门模型实现基于提示的高质量对齐歌曲合成,在人声、伴奏、歌词和旋律方面均优于基线模型。
English: VersBand is a multi-task song generation framework that synthesizes high-quality, aligned songs with prompt-based control, outperforming baselines across various tasks through its specialized models for vocals, accompaniments, lyrics, and melodies.

Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, Erfan Sadraiye, Kiarash Kiani Feriz, Mahdi Teymouri Nahad, Ali Moghadasi, Abolfazl Eshagh Abianeh, Nizi Nazar, Hamid R. Rabiee, Mahdieh Soleymani Baghshah, Meisam Ahmadi, Ehsaneddin Asgari
Title: Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions
Abstract:
Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
Chinese: 生成式AI通过整合面部、手势和动作合成等领域的进展,正在革新角色动画技术,本综述为研究人员提供了全面指导,涵盖当前技术与未来发展方向。
English: Generative AI is revolutionizing character animation by integrating advancements in facial, gesture, and motion synthesis, offering a comprehensive review to guide researchers through current technologies and future directions.

Authors:Di Wu, Yibin Lei, Christof Monz
Title: Calibrating Translation Decoding with Quality Estimation on LLMs
Abstract:
Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-llm-mt.
中文: 本文提出一种通过优化假设似然与翻译质量相关性进行校准的方法,仅需少量训练即可大幅提升大语言模型的翻译性能,同时提高解码效率并可直接作为翻译质量评估指标。
English: This paper introduces a calibration method that optimizes the correlation between hypothesis likelihoods and translation quality, significantly improving neural machine translation performance in large language models with minimal training and enhancing decoding efficiency.

Authors:Justin Mücke, Ansgar Scherp
Title: GLaMoR: Consistency Checking of OWL Ontologies using Graph Language Models
Abstract:
Semantic reasoning aims to infer new knowledge from existing knowledge, with OWL ontologies serving as a standardized framework for organizing information. A key challenge in semantic reasoning is verifying ontology consistency. However, state-of-the-art reasoners are computationally expensive, and their efficiency decreases as ontology sizes grow. While classical machine learning models have been explored for consistency checking, they struggle to capture complex relationships within ontologies. Large language models (LLMs) have shown promising results for simple reasoning tasks but perform poorly on structured reasoning. The recently introduced Graph Language Model (GLM) offers a way to simultaneously process graph-structured data and text. This paper proposes GLaMoR (Graph Language Model for Reasoning), a reasoning pipeline that transforms OWL ontologies into graph-structured data and adapts the GLM architecture for consistency checking. We evaluate GLaMoR on ontologies from the NCBO BioPortal repository, converting them into triples suitable for model input. Our results show that the GLM outperforms all baseline models, achieving $95\%$ accuracy while being 20 times faster than classical reasoners. The Code is accessible under: https://github.com/JustinMuecke/GLaMoR
Chinese Summary: 本文提出GLaMoR推理框架,通过将OWL本体转换为图结构数据并采用图语言模型进行一致性检测,在保持95%准确率的同时,比传统推理器提速20倍。
English Summary: This paper introduces GLaMoR, a reasoning pipeline that converts OWL ontologies into graph-structured data and utilizes a Graph Language Model for efficient consistency checking, achieving 95% accuracy and 20x faster performance compared to traditional reasoners.

Authors:Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody
Title: Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
Abstract:
The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.
中文: GoAT是一种基于图推理框架的黑盒方法,能高效生成可读性强且有效的对抗性提示来测试大语言模型的对齐性,相比现有最优攻击方法,它以更少查询次数实现了显著更高的越狱成功率。
English: GoAT is a black-box method that uses a graph-based reasoning framework to efficiently generate effective, human-readable adversarial prompts for testing LLM alignment, achieving significantly higher jailbreak success rates with fewer queries than current state-of-the-art attacks.

Authors:Gal Almog, Ariel Shamir, Ohad Fried
Title: REED-VAE: RE-Encode Decode Training for Iterative Image Editing with Diffusion Models
Abstract:
While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE
中文: REED训练方案通过优化变分自编码器,实现了高质量的多方法迭代图像编辑,即使在多次操作后仍能保持图像质量,支持扩散模型与传统编辑技术的灵活结合。
English: The REED training scheme for VAEs enables high-quality iterative image editing by preserving image integrity across multiple operations, supporting both diffusion-based and conventional techniques without accumulating artifacts.

Authors:Xuyin Qi, Zeyu Zhang, Canxuan Gang, Hao Zhang, Lei Zhang, Zhiwei Zhang, Yang Zhao
Title: MediAug: Exploring Visual Augmentation in Medical Imaging
Abstract:
Data augmentation is essential in medical imaging for improving classification accuracy, lesion detection, and organ segmentation under limited data conditions. However, two significant challenges remain. First, a pronounced domain gap between natural photographs and medical images can distort critical disease features. Second, augmentation studies in medical imaging are fragmented and limited to single tasks or architectures, leaving the benefits of advanced mix-based strategies unclear. To address these challenges, we propose a unified evaluation framework with six mix-based augmentation methods integrated with both convolutional and transformer backbones on brain tumour MRI and eye disease fundus datasets. Our contributions are threefold. (1) We introduce MediAug, a comprehensive and reproducible benchmark for advanced data augmentation in medical imaging. (2) We systematically evaluate MixUp, YOCO, CropMix, CutMix, AugMix, and SnapMix with ResNet-50 and ViT-B backbones. (3) We demonstrate through extensive experiments that MixUp yields the greatest improvement on the brain tumor classification task for ResNet-50 with 79.19% accuracy and SnapMix yields the greatest improvement for ViT-B with 99.44% accuracy, and that YOCO yields the greatest improvement on the eye disease classification task for ResNet-50 with 91.60% accuracy and CutMix yields the greatest improvement for ViT-B with 97.94% accuracy. Code will be available at https://github.com/AIGeeksGroup/MediAug.
Chinese: 本研究提出了MediAug这一统一基准,用于评估六种混合数据增强方法在医学影像任务中的表现,结果表明MixUp和SnapMix在脑肿瘤分类中准确率最高,而YOCO和CutMix在不同神经网络架构的眼疾分类任务中表现最优。
English: This study introduces MediAug, a unified benchmark for evaluating six mix-based data augmentation methods on medical imaging tasks, demonstrating that MixUp and SnapMix achieve the highest accuracy for brain tumor classification, while YOCO and CutMix excel in eye disease classification across different neural network architectures.

Authors:Junjie Zhou
Title: Feature Fusion Revisited: Multimodal CTR Prediction for MMCTR Challenge
Abstract:
With the rapid advancement of Multimodal Large Language Models (MLLMs), an increasing number of researchers are exploring their application in recommendation systems. However, the high latency associated with large models presents a significant challenge for such use cases. The EReL@MIR workshop provided a valuable opportunity to experiment with various approaches aimed at improving the efficiency of multimodal representation learning for information retrieval tasks. As part of the competition's requirements, participants were mandated to submit a technical report detailing their methodologies and findings. Our team was honored to receive the award for Task 2 - Winner (Multimodal CTR Prediction). In this technical report, we present our methods and key findings. Additionally, we propose several directions for future work, particularly focusing on how to effectively integrate recommendation signals into multimodal representations. The codebase for our implementation is publicly available at: https://github.com/Lattice-zjj/MMCTR_Code, and the trained model weights can be accessed at: https://huggingface.co/FireFlyCourageous/MMCTR_DIN_MicroLens_1M_x1.
Chinese: 本报告阐述了在EReL@MIR研讨会任务二中获奖的多模态CTR预测方法与关键发现,提出了将推荐信号融入多模态表征的未来研究方向,并公开了相关代码和模型资源。
English: This report details the award-winning methods and findings from the EReL@MIR workshop's Task 2 on multimodal CTR prediction, proposing future directions for integrating recommendation signals into multimodal representations and making the code and model publicly available.

Authors:Ali Nazari, Mohsen Ebrahimi Moghaddam, Omidreza Borzoei
Title: Kinship Verification through a Forest Neural Network
Abstract:
Early methods used face representations in kinship verification, which are less accurate than joint representations of parents' and children's facial images learned from scratch. We propose an approach featuring graph neural network concepts to utilize face representations and have comparable results to joint representation algorithms. Moreover, we designed the structure of the classification module and introduced a new combination of losses to engage the center loss gradually in training our network. Additionally, we conducted experiments on KinFaceW-I and II, demonstrating the effectiveness of our approach. We achieved the best result on KinFaceW-II, an average improvement of nearly 1.6 for all kinship types, and we were near the best on KinFaceW-I. The code is available at https://github.com/ali-nazari/Kinship-Verification
Chinese: 我们采用图神经网络优化面部表征进行亲属关系验证,其效果与联合表征算法相当,并在KinFaceW数据集上实现了精度提升,创下新记录。
English: Our method employs graph neural networks to enhance face representations for kinship verification, achieving results comparable to joint representation algorithms and setting new benchmarks on KinFaceW datasets with improved accuracy.

Authors:Zhongpu Chen, Wanjun Hao, Ziang Zeng, Long Shi, Yi Wen, Zhi-Jie Wang, Yu Zhao
Title: LiLIS: Enhancing Big Spatial Data Processing with Lightweight Distributed Learned Index
Abstract:
The efficient management of big spatial data is crucial for location-based services, particularly in smart cities. However, existing systems such as Simba and Sedona, which incorporate distributed spatial indexing, still incur substantial index construction overheads, rendering them far from optimal for real-time analytics. Recent studies demonstrate that learned indices can achieve high efficiency through well-designed machine learning models, but how to design a learned index for distributed spatial analytics remains unaddressed. In this paper, we present LiLIS, a Lightweight Distributed Learned Index for big spatial data. LiLIS combines machine-learned search strategies with spatial-aware partitioning within a distributed framework, and efficiently implements common spatial queries, including point query, range query, k-nearest neighbors (kNN), and spatial joins. Extensive experimental results over real-world and synthetic datasets show that LiLIS outperforms state-of-the-art big spatial data analytics by $2-3$ orders of magnitude for most spatial queries, and the index building achieves $1.5-2\times$ speed-up. The code is available at https://github.com/SWUFE-DB-Group/learned-index-spark.
Chinese: LiLIS提出了一种轻量级分布式学习索引,通过结合机器学习与空间分区技术,大幅提升了大规模空间数据查询性能并加快了索引构建速度。
English: LiLIS introduces a lightweight distributed learned index that integrates machine learning with spatial partitioning to significantly enhance query performance and accelerate index construction for big spatial data analytics.

Authors:Robert Leppich, Michael Stenger, Daniel Grillmeyer, Vanessa Borst, Samuel Kounev
Title: TSRM: A Lightweight Temporal Feature Encoding Architecture for Time Series Forecasting and Imputation
Abstract:
We introduce a temporal feature encoding architecture called Time Series Representation Model (TSRM) for multivariate time series forecasting and imputation. The architecture is structured around CNN-based representation layers, each dedicated to an independent representation learning task and designed to capture diverse temporal patterns, followed by an attention-based feature extraction layer and a merge layer, designed to aggregate extracted features. The architecture is fundamentally based on a configuration that is inspired by a Transformer encoder, with self-attention mechanisms at its core. The TSRM architecture outperforms state-of-the-art approaches on most of the seven established benchmark datasets considered in our empirical evaluation for both forecasting and imputation tasks. At the same time, it significantly reduces complexity in the form of learnable parameters. The source code is available at https://github.com/RobertLeppich/TSRM.
中文摘要:时间序列表示模型(TSRM)采用基于CNN和注意力机制的架构,在多元时间序列预测和填补任务中优于现有最优方法,同时显著降低了模型复杂度。
English Summary: The Time Series Representation Model (TSRM) introduces a CNN-based architecture with attention mechanisms for multivariate time series forecasting and imputation, outperforming state-of-the-art methods while reducing complexity.

Authors:Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi, Basit Alawode, Asim Khan, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood
Title: Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Abstract:
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP
中文摘要:本研究提出了一种多分辨率视觉语言模型,通过在不同放大倍数下对齐视觉与文本数据,显著提升了计算病理学中的癌症亚型分类和组织分析性能,并在多项任务中超越了现有先进方法。
English Summary: This study introduces a multi-resolution vision-language model for computational pathology that enhances cancer subtype classification and tissue analysis by aligning visual and textual data across different magnifications, outperforming existing methods on multiple tasks.

Authors:Hayley Ross, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara
Title: When2Call: When (not) to Call Tools
Abstract:
Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.
Chinese: When2Call基准测试评估语言模型在何时使用工具方面的决策能力,揭示了当前模型的显著不足,并提出了优于传统微调的训练方法。
English: The When2Call benchmark evaluates language models' decision-making on when to use tools, revealing significant gaps in current models and introducing a training method that outperforms traditional fine-tuning.

Authors:Hang Yu, Jiahao Wen, Zhedong Zheng
Title: CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
Abstract:
Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.
中文: 本文提出了一种基于跨模态自适应元学习的领域无关预训练框架CAMeL,通过动态误差样本记忆单元和自适应双速更新策略增强模型泛化能力,在真实场景基准测试中超越现有最优方法,并展现出对偏差合成数据和噪声文本标注的强鲁棒性。
English: This paper introduces a domain-agnostic pretraining framework called CAMeL that enhances model generalization through cross-modality adaptive meta-learning, outperforming state-of-the-art methods on real-world benchmarks while demonstrating robustness to biased synthetic data and noisy text annotations.

Authors:Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, Gene W. Yeo, David E. Neal, Maxim Khan, Christopher D. Rosin, Ramamohan Paturi, Leon Bergen
Title: EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers
Abstract:
We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench
中文: 本研究提出了EvidenceBench基准,通过基于专家判断生成假设并标注论文的流程,评估模型在识别生物医学假设相关证据方面的表现,发现现有模型性能仍远低于专家水平。
English: This research introduces EvidenceBench, a benchmark for evaluating how well models identify evidence relevant to biomedical hypotheses, created through a pipeline that generates hypotheses and annotates papers based on expert judgments, and finds current models still lag significantly behind human expert performance.

Authors:Tung D. Vu, Chung Hoang, Truong-Son Hy
Title: Multimodal graph representation learning for website generation based on visual sketch
Abstract:
The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at https://github.com/HySonLab/Design2Code
中文摘要:本文提出一种多模态图表示学习方法,通过整合视觉和结构信息将设计稿精确转换为HTML代码,相比现有技术展现出显著优越的性能。
English Summary: This paper introduces a multimodal graph representation learning method that effectively converts digital designs into accurate HTML code by integrating visual and structural information, demonstrating superior performance over existing techniques.

Authors:Felix Burr, Marcel Hoffmann, Ansgar Scherp
Title: Active Few-Shot Learning for Vertex Classification Starting from an Unlabeled Dataset
Abstract:
Despite the ample availability of graph data, obtaining vertex labels is a tedious and expensive task. Therefore, it is desirable to learn from a few labeled vertices only. Existing few-shot learners assume a class oracle, which provides labeled vertices for a desired class. However, such an oracle is not available in a real-world setting, i.e., when drawing a vertex for labeling it is unknown to which class the vertex belongs. Few-shot learners are often combined with prototypical networks, while classical semi-supervised vertex classification uses discriminative models, e.g., Graph Convolutional Networks (GCN). In this paper, we train our models by iteratively prompting a human annotator with vertices to annotate. We perform three experiments where we continually relax our assumptions. First, we assume a class oracle, i.e., the human annotator is provided with an equal number of vertices to label for each class. We denote this as "Balanced Sampling''. In the subsequent experiment, "Unbalanced Sampling,'' we replace the class oracle with $k$-medoids clustering and draw vertices to label from the clusters. In the last experiment, the "Unknown Number of Classes,'' we no longer assumed we knew the number and distribution of classes. Our results show that prototypical models outperform discriminative models in all experiments when fewer than $20$ samples per class are available. While dropping the assumption of the class oracle for the "Unbalanced Sampling'' experiment reduces the performance of the GCN by $9\%$, the prototypical network loses only $1\%$ on average. For the "Unknown Number of Classes'' experiment, the average performance for both models decreased further by $1\%$. Source code: https://github.com/Ximsa/2023-felix-ma
中文摘要:本研究表明,在顶点分类的少样本学习中,原型网络始终优于如GCN等判别模型,尤其在标记数据稀缺时表现更佳,且在放宽类别分布假设时仍保持较强鲁棒性。
English Summary: This study demonstrates that prototypical networks consistently outperform discriminative models like GCNs in few-shot vertex classification, particularly when labeled data is scarce, and maintains robustness even when class distribution assumptions are relaxed.

Authors:Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Guofei Chen, Ji Zhang, Wenshan Wang
Title: SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models
Abstract:
Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D.
中文: SORT3D是一种创新方法,它结合2D物体属性、空间推理和大语言模型,无需文本到3D训练数据即可实现零样本三维物体定位,在未知环境中达到最先进性能。
English: SORT3D is a novel method that combines 2D object attributes with spatial reasoning and large language models to enable zero-shot 3D object grounding without requiring text-to-3D training data, achieving state-of-the-art performance in unseen environments.

Authors:Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri
Title: WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Abstract:
Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP -- a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals -- highlighting the current state of security by incompetence.
中文: WASP基准测试表明,即使在现实网络场景中,先进AI模型也易受简单提示注入攻击,部分攻击成功率高达86%,但因智能体安全能力不足,往往无法完全实现攻击者目标。
English: The WASP benchmark reveals that even advanced AI models are vulnerable to simple prompt injections in realistic web scenarios, with attacks partially succeeding in up to 86% of cases while often failing to fully achieve attacker goals due to agents' security incompetence.

Authors:Jialei Song, Xingquan Zuo, Feiyang Wang, Hai Huang, Tianle Zhang
Title: RDI: An adversarial robustness evaluation metric for deep neural networks based on model statistical features
Abstract:
Deep neural networks (DNNs) are highly susceptible to adversarial samples, raising concerns about their reliability in safety-critical tasks. Currently, methods of evaluating adversarial robustness are primarily categorized into attack-based and certified robustness evaluation approaches. The former not only relies on specific attack algorithms but also is highly time-consuming, while the latter due to its analytical nature, is typically difficult to implement for large and complex models. A few studies evaluate model robustness based on the model's decision boundary, but they suffer from low evaluation accuracy. To address the aforementioned issues, we propose a novel adversarial robustness evaluation metric, Robustness Difference Index (RDI), which is based on model statistical features. RDI draws inspiration from clustering evaluation by analyzing the intra-class and inter-class distances of feature vectors separated by the decision boundary to quantify model robustness. It is attack-independent and has high computational efficiency. Experiments show that, RDI demonstrates a stronger correlation with the gold-standard adversarial robustness metric of attack success rate (ASR). The average computation time of RDI is only 1/30 of the evaluation method based on the PGD attack. Our open-source code is available at: https://github.com/BUPTAIOC/RDI.
Chinese: 提出的鲁棒性差异指数(RDI)通过分析决策边界分隔的特征向量类内与类间距离,提供了一种独立于攻击且计算高效的对抗鲁棒性评估方法,与攻击成功率高度相关,计算时间仅为基于PGD攻击评估方法的1/30。
English: The proposed Robustness Difference Index (RDI) offers an attack-independent and computationally efficient method for evaluating adversarial robustness by analyzing intra-class and inter-class feature distances, showing strong correlation with attack success rates and requiring only 1/30 of the time compared to PGD-based evaluation.

Authors:Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun
Title: Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Abstract:
Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.
中文摘要:该研究提出CAV2vec自监督框架,通过训练模型从受损输入中预测纯净目标,有效提升视听语音识别系统在现实环境中应对音频和视觉干扰的鲁棒性。
English Summary: The study introduces CAV2vec, a self-supervised framework that enhances audio-visual speech recognition by training models to predict clean targets from corrupted inputs, improving robustness against real-world audio and visual disruptions.

Authors:Gwen Yidou Weng, Benjie Wang, Guy Van den Broeck
Title: TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation
Abstract:
As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either post-train LMs for each new attribute--expensive and inflexible--or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduce TRACE (Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractable probabilistic reasoning and lightweight control. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM's predicted futures. This EAP is then used to reweigh the LM's next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art detoxification results with only 20% decoding overhead, yields 76 low-resource personalized LMs within seconds, and seamlessly extends to composite attributes. Our code is available at: https://github.com/yidouweng/trace.
中文: TRACE是一种新颖框架,通过可处理的概率推理高效计算预期属性概率并适应新属性,以最小的解码开销实现最先进的净化效果,并能快速个性化语言模型。
English: TRACE is a novel framework that efficiently computes expected attribute probabilities and adapts to new attributes through tractable probabilistic reasoning, achieving state-of-the-art detoxification with minimal decoding overhead and enabling rapid personalization of language models.

Authors:Jonas Frey, Turcan Tuna, Lanke Frank Tarimo Fu, Cedric Weibel, Katharine Patterson, Benjamin Krummenacher, Matthias Müller, Julian Nubert, Maurice Fallon, Cesar Cadena, Marco Hutter
Title: Boxi: Design Decisions in the Context of Algorithmic Performance for Robotics
Abstract:
Achieving robust autonomy in mobile robots operating in complex and unstructured environments requires a multimodal sensor suite capable of capturing diverse and complementary information. However, designing such a sensor suite involves multiple critical design decisions, such as sensor selection, component placement, thermal and power limitations, compute requirements, networking, synchronization, and calibration. While the importance of these key aspects is widely recognized, they are often overlooked in academia or retained as proprietary knowledge within large corporations. To improve this situation, we present Boxi, a tightly integrated sensor payload that enables robust autonomy of robots in the wild. This paper discusses the impact of payload design decisions made to optimize algorithmic performance for downstream tasks, specifically focusing on state estimation and mapping. Boxi is equipped with a variety of sensors: two LiDARs, 10 RGB cameras including high-dynamic range, global shutter, and rolling shutter models, an RGB-D camera, 7 inertial measurement units (IMUs) of varying precision, and a dual antenna RTK GNSS system. Our analysis shows that time synchronization, calibration, and sensor modality have a crucial impact on the state estimation performance. We frame this analysis in the context of cost considerations and environment-specific challenges. We also present a mobile sensor suite `cookbook` to serve as a comprehensive guideline, highlighting generalizable key design considerations and lessons learned during the development of Boxi. Finally, we demonstrate the versatility of Boxi being used in a variety of applications in real-world scenarios, contributing to robust autonomy. More details and code: https://github.com/leggedrobotics/grand_tour_box
Chinese: Boxi 是一款高度集成的传感器载荷,通过优化传感器选择、同步和校准,旨在提升机器人在复杂环境中的自主能力,并在多种实际场景中验证了其有效性。
English: Boxi is a highly integrated sensor payload designed to enhance robot autonomy in complex environments by optimizing sensor selection, synchronization, and calibration, with its effectiveness demonstrated across various real-world applications.

Authors:Alejandro Murillo-Gonzalez, Lantao Liu
Title: Action Flow Matching for Continual Robot Learning
Abstract:
Continual learning in robotics seeks systems that can constantly adapt to changing environments and tasks, mirroring human adaptability. A key challenge is refining dynamics models, essential for planning and control, while addressing issues such as safe adaptation, catastrophic forgetting, outlier management, data efficiency, and balancing exploration with exploitation -- all within task and onboard resource constraints. Towards this goal, we introduce a generative framework leveraging flow matching for online robot dynamics model alignment. Rather than executing actions based on a misaligned model, our approach refines planned actions to better match with those the robot would take if its model was well aligned. We find that by transforming the actions themselves rather than exploring with a misaligned model -- as is traditionally done -- the robot collects informative data more efficiently, thereby accelerating learning. Moreover, we validate that the method can handle an evolving and possibly imperfect model while reducing, if desired, the dependency on replay buffers or legacy model snapshots. We validate our approach using two platforms: an unmanned ground vehicle and a quadrotor. The results highlight the method's adaptability and efficiency, with a record 34.2\% higher task success rate, demonstrating its potential towards enabling continual robot learning. Code: https://github.com/AlejandroMllo/action_flow_matching.
中文: 本研究提出了一种利用流匹配的生成框架,在线优化机器人动作规划,提高了数据收集效率和适应性,减少了对回放缓冲区的依赖,在持续学习场景中实现了任务成功率提升34.2%。
English: This study introduces a generative framework using flow matching to refine robot action plans online, enhancing data collection efficiency and adaptability without heavy reliance on replay buffers, achieving a 34.2% higher task success rate in continual learning scenarios.

Authors:Ryo Yamaki, Shintaro Shiba, Guillermo Gallego, Yoshimitsu Aoki
Title: Iterative Event-based Motion Segmentation by Variational Contrast Maximization
Abstract:
Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e., motion segmentation), which is useful for various tasks such as object detection and visual servoing. We propose an iterative motion segmentation method, by classifying events into background (e.g., dominant motion hypothesis) and foreground (independent motion residuals), thus extending the Contrast Maximization framework. Experimental results demonstrate that the proposed method successfully classifies event clusters both for public and self-recorded datasets, producing sharp, motion-compensated edge-like images. The proposed method achieves state-of-the-art accuracy on moving object detection benchmarks with an improvement of over 30%, and demonstrates its possibility of applying to more complex and noisy real-world scenes. We hope this work broadens the sensitivity of Contrast Maximization with respect to both motion parameters and input events, thus contributing to theoretical advancements in event-based motion segmentation estimation. https://github.com/aoki-media-lab/event_based_segmentation_vcmax
中文摘要:该研究提出的迭代运动分割方法将事件相机数据分类为背景和前景运动,扩展了对比度最大化框架,在运动目标检测中实现了超过30%的性能提升,达到了当前最优精度水平。
English Summary: The proposed iterative motion segmentation method classifies event camera data into background and foreground motions, extending the Contrast Maximization framework to achieve state-of-the-art accuracy with over 30% improvement in moving object detection.

Authors:Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, Eric Keller
Title: Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface
Abstract:
Accelerators such as neural processing units (NPUs) deliver an enticing balance of performance and efficiency compared to general purpose compute architectures. However, effectively leveraging accelerator capabilities is not always simple: low-level programming toolkits may require substantial developer effort while high-level programming toolkits may abstract critical optimization features. This work aims to increase efficiency of designers using IRON, a toolkit for close-to-metal NPU performance engineers. We provide an updated programmer interface to IRON containing new and refined programming constructs. The new interface includes extensible features for placement and data transformation. These contributions are evaluated in terms of 1) efficiency, with analysis showing ~26% average reduction in lines of code and decreases in Halstead metrics for a variety of designs; 2) expressivity, demonstrating the new interface supports the wide range of features and patterns already supported by IRON; and 3) extensibility, illustrating the new tooling for placement and tiling can be extended to accommodate common use-cases.
中文摘要:IRON工具包通过更新编程接口,在保持表达力和扩展性的同时,显著降低了NPU性能工程师的代码复杂度,提升了开发效率。
English Summary: IRON toolkit's updated interface enhances NPU programming efficiency by reducing code complexity while maintaining expressivity and extensibility for performance engineers.

Authors:KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou
Title: Kimi-Audio Technical Report
Abstract:
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Chinese: Kimi-Audio是一款开源的音频基础模型,凭借创新的架构和海量数据训练,在音频理解、生成与对话方面表现卓越,并在多项基准测试中达到领先水平。
English: Kimi-Audio is an open-source audio foundation model that excels in understanding, generating, and conversing with audio, achieving state-of-the-art performance across various benchmarks through innovative architecture and extensive data training.

Authors:Alan Khoja, Martin Kölbl, Stefan Leue, Rüdiger Wilhelmi
Title: Automated Consistency Analysis for Legal Contracts
Abstract:
Business contracts, particularly sale and purchase agreements, often contain a large number of clauses and are correspondingly long and complex. In practice, it is therefore a great challenge to keep track of their legal context and to identify and avoid inconsistencies in such contracts. Against this background, we describe a method and tool called ContractCheck which allows for the consistency analysis of legal contracts, in particular Share Purchase Agreements (SPAs). In order to identify the concepts that are relevant for an analysis we define an ontology for SPAs. The analysis is, then, based on an encoding of the preconditions for the execution of the clauses of an SPA, as well as on a set of proposed consistency constraints formalized using decidable fragments of First-Order Logic (FOL). Based on the ontology for SPAs, textual SPAs are first encoded in a structured natural language format that we refer to as ``blocks''. ContractCheck interprets these blocks and constraints and translates them into assertions formulated in FOL. It then invokes a Satisfiability Modulo Theory (SMT) solver in order to check the executability of a considered contract, either by providing a satisfying model, or by proving the existence of conflicting clauses that prevent the contract from being executed. We illustrate the application of ContractCheck to concrete SPAs, including one example of an SPA of realistic size and complexity, and conclude by suggesting directions for future research.
中文:ContractCheck是一种工具,通过运用本体论和一阶逻辑,对复杂的商业合同如股权购买协议进行一致性分析,利用SMT求解器检测冲突并确保合同的可执行性。
English: ContractCheck is a tool that analyzes the consistency of complex business contracts like Share Purchase Agreements by using an ontology and First-Order Logic to detect conflicts and ensure executability through an SMT solver.

Authors:Ning Xian, Yixing Fan, Ruqing Zhang, Maarten de Rijke, Jiafeng Guo
Title: An Empirical Study of Evaluating Long-form Question Answering
Abstract:
\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.
中文: 本研究探讨了自动评估指标在长格式问答评价中的可靠性,揭示了答案风格、长度和问题类型导致的偏差,同时证明细粒度评估能在某些指标上缓解这一问题。
English: This study investigates the reliability of automatic metrics for evaluating long-form question answering, revealing biases related to answer style, length, and question type while demonstrating that fine-grained evaluation can partially mitigate these issues.

Authors:Xinmin Feng, Zhuoyuan Li, Li Li, Dong Liu, Feng Wu
Title: Partition Map-Based Fast Block Partitioning for VVC Inter Coding
Abstract:
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.
中文: 该方法通过基于神经网络的划分图预测和双阈值决策方案,在随机访问配置下实现了51.30%的编码时间节省,仅带来2.12%的BDBR性能损失。
English: The proposed neural network-based method with a dual-threshold scheme significantly reduces VVC encoding complexity by 51.30% while maintaining minimal rate-distortion performance loss.

Authors:Kesen Zhao, Beier Zhu, Qianru Sun, Hanwang Zhang
Title: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Abstract:
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.
中文: UV-CoT框架通过无监督的视觉思维链推理,利用偏好优化比较模型生成的边界框而无需标注,提升了空间推理能力并在多个数据集上展现出优越的泛化性能。
English: The UV-CoT framework introduces an unsupervised approach for visual chain-of-thought reasoning, using preference optimization to compare model-generated bounding boxes without annotations, enhancing spatial reasoning and generalization across datasets.

Authors:Lei Shen, Xiaoyu Shen
Title: Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant
Abstract:
In recent years, multi-agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto-SLURP, a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset -- initially developed for natural language understanding tasks -- by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end-to-end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks, highlighting that truly reliable and intelligent multi-agent personal assistants remain a work in progress. The dataset and related code are available at https://github.com/lorashen/Auto-SLURP/.
Chinese: Auto-SLURP是一个通过重新标注SLURP数据集并集成模拟服务来评估基于大语言模型的多智能体个人助手的基准数据集,实验表明现有先进框架仍难以实现真正可靠的智能表现。
English: Auto-SLURP is a new benchmark dataset designed to evaluate LLM-based multi-agent frameworks for intelligent personal assistants by extending the SLURP dataset with relabeled data and simulated services, revealing current systems' limitations in achieving reliable performance.

Authors:Zhengru Fang, Zhenghao Liu, Jingjing Wang, Senkang Hu, Yu Guo, Yiqin Deng, Yuguang Fang
Title: Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy
Abstract:
To support the Low Altitude Economy (LAE), it is essential to achieve precise localization of unmanned aerial vehicles (UAVs) in urban areas where global positioning system (GPS) signals are unavailable. Vision-based methods offer a viable alternative but face severe bandwidth, memory and processing constraints on lightweight UAVs. Inspired by mammalian spatial cognition, we propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers. We introduce the Orthogonally-constrained Variational Information Bottleneck encoder (O-VIB), which incorporates automatic relevance determination (ARD) to prune non-informative features while enforcing orthogonality to minimize redundancy. This enables efficient and accurate localization with minimal transmission cost. Extensive evaluation on a dedicated LAE UAV dataset shows that O-VIB achieves high-precision localization under stringent bandwidth budgets. Code and dataset will be made publicly available at: github.com/fangzr/TOC-Edge-Aerial.
中文摘要:本研究提出了一种面向任务的通信框架,采用正交约束变分信息瓶颈编码器,通过向边缘服务器传输紧凑的多视角特征,实现在GPS信号缺失城市环境中无人机的高效精准定位。
English Summary: The study introduces a task-oriented communication framework using an Orthogonally-constrained Variational Information Bottleneck encoder to enable efficient and accurate UAV localization in GPS-denied urban environments by transmitting compact multi-view features to edge servers.

Authors:Marco Turzi, Siamak Mehrkanoon
Title: SSA-UNet: Advanced Precipitation Nowcasting via Channel Shuffling
Abstract:
Weather forecasting is essential for facilitating diverse socio-economic activity and environmental conservation initiatives. Deep learning techniques are increasingly being explored as complementary approaches to Numerical Weather Prediction (NWP) models, offering potential benefits such as reduced complexity and enhanced adaptability in specific applications. This work presents a novel design, Small Shuffled Attention UNet (SSA-UNet), which enhances SmaAt-UNet's architecture by including a shuffle channeling mechanism to optimize performance and diminish complexity. To assess its efficacy, this architecture and its reduced variant are examined and trained on two datasets: a Dutch precipitation dataset from 2016 to 2019, and a French cloud cover dataset containing radar images from 2017 to 2018. Three output configurations of the proposed architecture are evaluated, yielding outputs of 1, 6, and 12 precipitation maps, respectively. To better understand how this model operates and produces its predictions, a gradient-based approach called Grad-CAM is used to analyze the outputs generated. The analysis of heatmaps generated by Grad-CAM facilitated the identification of regions within the input maps that the model considers most informative for generating its predictions. The implementation of SSA-UNet can be found on our Github\footnote{\href{https://github.com/MarcoTurzi/SSA-UNet}{https://github.com/MarcoTurzi/SSA-UNet}}
中文: 本研究提出SSA-UNet深度学习模型,通过引入通道混洗机制优化架构以提升天气预报性能,并基于荷兰和法国气象数据集采用Grad-CAM方法进行预测可解释性分析。
English: This study introduces SSA-UNet, a deep learning model that enhances weather forecasting by optimizing architecture with a shuffle channeling mechanism, evaluated on Dutch and French meteorological datasets using Grad-CAM for interpretability.

Authors:Tao Wu, Kexue Fu, Qiang Hua, Xinxin Liu, Muhammad Ali Imran, Bo Liu
Title: LEAM: A Prompt-only Large Language Model-enabled Antenna Modeling Method
Abstract:
Antenna modeling is a time-consuming and complex process, decreasing the speed of antenna analysis and design. In this paper, a large language model (LLM)- enabled antenna modeling method, called LEAM, is presented to address this challenge. LEAM enables automatic antenna model generation based on language descriptions via prompt input, images, descriptions from academic papers, patents, and technical reports (either one or multiple). The effectiveness of LEAM is demonstrated by three examples: a Vivaldi antenna generated from a complete user description, a slotted patch antenna generated from an incomplete user description and the operating frequency, and a monopole slotted antenna generated from images and descriptions scanned from the literature. For all the examples, correct antenna models are generated in a few minutes. The code can be accessed via https://github.com/TaoWu974/LEAM.
中文:LEAM是一种基于大语言模型的天线建模方法,能够通过文本描述、图像等多种输入自动生成精确的天线设计,将建模时间大幅缩短至几分钟。
English: LEAM is an innovative antenna modeling method that utilizes large language models to automatically generate accurate antenna designs from various inputs like text descriptions and images, significantly speeding up the process to just minutes.

Authors:Elena Plekhanova, Damien Robert, Johannes Dollinger, Emilia Arens, Philipp Brun, Jan Dirk Wegner, Niklaus Zimmermann
Title: SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology
Abstract:
With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
Chinese: 为解决现有地理空间模型的偏差并更好地捕捉全球植被季节性,我们提出了基于物候学采样的SSL4Eco数据集,通过季节对比学习方法在多种生态任务上实现了最优性能。
English: To address biases in existing geospatial models and better capture global vegetation seasonality, we introduce SSL4Eco, a phenology-informed dataset, and demonstrate its superior performance on ecological tasks through a season-contrastive learning approach.

Authors:Ritesh Goru, Shanay Mehta, Prateek Jain
Title: One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
Abstract:
Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from $O\bigl(N^{3}\bigl)$ to $O\bigl(N^{2}\bigl)$ and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online (https://github.com/devrev/One-Pass-to-Reason).
Chinese: 该方法通过复制响应令牌和自定义注意力掩码,实现了多轮对话的单次处理,在保持精度的同时将时间复杂度从O(N³)降低至O(N²),且损失与N次处理方法完全相同。
English: The proposed method duplicates response tokens with a custom attention mask to enable single-pass processing of multi-turn conversations, achieving identical losses to the N-pass approach while reducing time complexity from O(N³) to O(N²) and preserving accuracy.

Authors:Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
Title: E-InMeMo: Enhanced Prompting for Visual In-Context Learning
Abstract:
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo
Chinese: 提出的E-InMeMo方法通过在上下文对中引入可学习的扰动来增强视觉上下文学习,在标准视觉任务中实现了显著性能提升,其中前景分割mIoU指标比基线方法提高7.99,单目标检测提升17.04。
English: The proposed E-InMeMo method enhances visual in-context learning by integrating learnable perturbations into in-context pairs, achieving state-of-the-art performance improvements of 7.99 mIoU in foreground segmentation and 17.04 mIoU in single object detection over baseline methods.

Authors:Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee
Title: DOSE : Drum One-Shot Extraction from Music Mixture
Abstract:
Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE
中文: 本文提出Drum One-Shot Extractor (DOSE)模型,通过神经音频编解码语言模型和创新的起始点损失函数,直接从音乐混合音频中提取高质量鼓单音样本,其性能优于传统音源分离方法。
English: This paper introduces the Drum One-Shot Extractor (DOSE) model, which uses neural audio codec language models and a novel onset loss to directly extract high-quality drum one-shots from music mixtures, outperforming traditional source separation methods.

Authors:Jingjin Wang, Jiawei Han
Title: PropRAG: Guiding Retrieval with Beam Search over Proposition Paths
Abstract:
Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up-to-date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi-hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context-rich propositions and introduces an efficient, LLM-free online beam search over proposition paths to discover multi-step reasoning chains. By coupling a higher-fidelity knowledge representation with explicit path discovery, PropRAG achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.
Chinese: PropRAG 提出了一种利用情境化命题和基于命题路径的束搜索框架,无需在线调用大语言模型即可实现显式多步推理,通过更丰富的表征改进证据检索,在多个基准测试中取得了最先进的性能。
English: PropRAG introduces a framework using contextual propositions and beam search over proposition paths to enable explicit multi-step reasoning without online LLM inference, achieving state-of-the-art results on multiple benchmarks by enhancing evidence retrieval through richer representations.

Authors:Jingjin Wang
Title: PropRAG: Guiding Retrieval with Beam Search over Proposition Paths
Abstract:
Retrieval Augmented Generation (RAG) has become the standard non-parametric approach for equipping Large Language Models (LLMs) with up-to-date knowledge and mitigating catastrophic forgetting common in continual learning. However, standard RAG, relying on independent passage retrieval, fails to capture the interconnected nature of human memory crucial for complex reasoning (associativity) and contextual understanding (sense-making). While structured RAG methods like HippoRAG utilize knowledge graphs (KGs) built from triples, the inherent context loss limits fidelity. We introduce PropRAG, a framework leveraging contextually rich propositions and a novel beam search algorithm over proposition paths to explicitly discover multi-step reasoning chains. Crucially, PropRAG's online retrieval process operates entirely without invoking generative LLMs, relying instead on efficient graph traversal and pre-computed embeddings. This avoids online LLM inference costs and potential inconsistencies during evidence gathering. LLMs are used effectively offline for high-quality proposition extraction and post-retrieval for answer generation. PropRAG achieves state-of-the-art zero-shot Recall@5 results on PopQA (55.3%), 2Wiki (93.7%), HotpotQA (97.0%), and MuSiQue (77.3%), alongside top F1 scores (e.g., 52.4% on MuSiQue). By improving evidence retrieval through richer representation and explicit, LLM-free online path finding, PropRAG advances non-parametric continual learning.
Chinese: PropRAG 提出了一种利用情境化命题和基于命题路径的束搜索框架,无需在线调用大语言模型即可实现显式多步推理,通过更丰富的表征改进证据检索,在多个基准测试中取得了最先进的性能。
English: PropRAG introduces a framework using contextual propositions and beam search over proposition paths to enable explicit multi-step reasoning without online LLM inference, achieving state-of-the-art results on multiple benchmarks by enhancing evidence retrieval through richer representations.

Authors:Zhuohao Yan, Shaoquan Feng, Xingxing Li, Yuxuan Zhou, Chunxi Xia, Shengyu Li
Title: S3MOT: Monocular 3D Object Tracking with Selective State Space Model
Abstract:
Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.
中文: 本研究提出了三项创新技术——用于高效数据关联的HSSM、提升目标重识别能力的FCOE和优化位姿估计的VeloSSM,显著推进了单目三维多目标跟踪技术,在KITTI基准测试中实现了最先进的性能。
English: This study introduces three innovative techniques—HSSM for efficient data association, FCOE for improved object re-identification, and VeloSSM for enhanced pose estimation—to advance monocular 3D multi-object tracking, achieving state-of-the-art performance on the KITTI benchmark.

Authors:Prachi Garg, Joseph K J, Vineeth N Balasubramanian, Necati Cihan Camgoz, Chengde Wan, Kenrick Kin, Weiguang Si, Shugao Ma, Fernando De La Torre
Title: POET: Prompt Offset Tuning for Continual Human Action Adaptation
Abstract:
As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.
Chinese: 本研究提出POET方法,通过隐私保护的少样本持续动作识别技术,让用户能在不存储敏感数据的情况下为XR设备高效添加新动作类别,并在两个新基准数据集上表现优于现有方法。
English: This research introduces POET, a privacy-aware few-shot continual action recognition method that enables users to efficiently add new action classes to XR devices without storing sensitive data, outperforming existing benchmarks on two new datasets.

Authors:Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Yingshui Tan, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu
Title: DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.
中文: DREAM方法通过监督微调和强化学习系统性地解构多模态大语言模型中的风险,在保持正常任务性能的同时将安全有效性评分比GPT-4V提升了16.17%。
English: The DREAM method enhances safety in Multimodal Large Language Models by systematically disentangling risks through supervised fine-tuning and reinforcement learning, achieving a 16.17% improvement in safety scores over GPT-4V without compromising performance.

Authors:Yiwei Zha
Title: SMARTFinRAG: Interactive Modularized Financial RAG Benchmark
Abstract:
Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform's open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.
中文摘要:本文介绍了SMARTFinRAG平台,通过动态组件交换、以文档为中心的问答生成和直观界面解决金融RAG评估的关键缺口,同时量化性能差异并支持透明化研究。
English Summary: This paper introduces SMARTFinRAG, a modular platform addressing key gaps in financial RAG evaluation through dynamic component interchange, document-centric QA generation, and an intuitive interface, while quantifying performance variations and supporting transparent research.

Authors:Hanrui Wang, Shuo Wang, Chun-Shien Lu, Isao Echizen
Title: DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion
Abstract:
Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%--92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%--9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.
Chinese: 面部识别系统因模型反演攻击可从嵌入中恢复身份信息而存在隐私漏洞,但现有方法计算成本高且泛化能力不足,为此开发的DiffMI作为一种无需训练的扩散驱动攻击方法,实现了高成功率,并具备更强的适应性和效率。
English: Face recognition systems face privacy vulnerabilities as model inversion attacks can recover identity information from embeddings, but existing methods are computationally costly and lack generalization, prompting the development of DiffMI, a training-free diffusion-based attack that achieves high success rates with greater adaptability and efficiency.

Authors:Kazi Shahrukh Omar, Shuaijie Wang, Ridhuparan Kungumaraju, Tanvi Bhatt, Fabio Miranda
Title: VIGMA: An Open-Access Framework for Visual Gait and Motion Analytics
Abstract:
Gait disorders are commonly observed in older adults, who frequently experience various issues related to walking. Additionally, researchers and clinicians extensively investigate mobility related to gait in typically and atypically developing children, athletes, and individuals with orthopedic and neurological disorders. Effective gait analysis enables the understanding of the causal mechanisms of mobility and balance control of patients, the development of tailored treatment plans to improve mobility, the reduction of fall risk, and the tracking of rehabilitation progress. However, analyzing gait data is a complex task due to the multivariate nature of the data, the large volume of information to be interpreted, and the technical skills required. Existing tools for gait analysis are often limited to specific patient groups (e.g., cerebral palsy), only handle a specific subset of tasks in the entire workflow, and are not openly accessible. To address these shortcomings, we conducted a requirements assessment with gait practitioners (e.g., researchers, clinicians) via surveys and identified key components of the workflow, including (1) data processing and (2) data analysis and visualization. Based on the findings, we designed VIGMA, an open-access visual analytics framework integrated with computational notebooks and a Python library, to meet the identified requirements. Notably, the framework supports analytical capabilities for assessing disease progression and for comparing multiple patient groups. We validated the framework through usage scenarios with experts specializing in gait and mobility rehabilitation. VIGMA is available at https://github.com/komar41/VIGMA.
中文摘要:步态分析对于理解不同人群的行走问题至关重要,但现有工具存在局限;为此,开发了开源可视化分析框架VIGMA来弥补不足,并已通过专家验证。
English Summary: Gait analysis is crucial for understanding mobility issues across diverse populations, but existing tools are limited; thus, VIGMA, an open-access visual analytics framework, was developed to address these gaps and validated by experts.

Authors:Kaiyuan Tang, Siyuan Yao, Chaoli Wang
Title: iVR-GS: Inverse Volume Rendering for Explorable Visualization via Editable 3D Gaussian Splatting
Abstract:
In volume visualization, users can interactively explore the three-dimensional data by specifying color and opacity mappings in the transfer function (TF) or adjusting lighting parameters, facilitating meaningful interpretation of the underlying structure. However, rendering large-scale volumes demands powerful GPUs and high-speed memory access for real-time performance. While existing novel view synthesis (NVS) methods offer faster rendering speeds with lower hardware requirements, the visible parts of a reconstructed scene are fixed and constrained by preset TF settings, significantly limiting user exploration. This paper introduces inverse volume rendering via Gaussian splatting (iVR-GS), an innovative NVS method that reduces the rendering cost while enabling scene editing for interactive volume exploration. Specifically, we compose multiple iVR-GS models associated with basic TFs covering disjoint visible parts to make the entire volumetric scene visible. Each basic model contains a collection of 3D editable Gaussians, where each Gaussian is a 3D spatial point that supports real-time scene rendering and editing. We demonstrate the superior reconstruction quality and composability of iVR-GS against other NVS solutions (Plenoxels, CCNeRF, and base 3DGS) on various volume datasets. The code is available at https://github.com/TouKaienn/iVR-GS.
中文摘要:本文提出iVR-GS方法,通过高斯抛洒实现逆向体绘制,在降低渲染成本的同时支持场景实时编辑,为交互式体数据探索提供了高质量的重建效果。
English Summary: This paper introduces iVR-GS, an inverse volume rendering method using Gaussian splatting that enables real-time scene editing and reduces rendering costs while maintaining high reconstruction quality for interactive volume exploration.

Authors:Mert Sonmezer, Seyda Ertekin
Title: CANet: ChronoAdaptive Network for Enhanced Long-Term Time Series Forecasting under Non-Stationarity
Abstract:
Long-term time series forecasting plays a pivotal role in various real-world applications. Despite recent advancements and the success of different architectures, forecasting is often challenging due to non-stationary nature of the real-world data, which frequently exhibit distribution shifts and temporal changes in statistical properties like mean and variance over time. Previous studies suggest that this inherent variability complicates forecasting, limiting the performance of many models by leading to loss of non-stationarity and resulting in over-stationarization (Liu, Wu, Wang and Long, 2022). To address this challenge, we introduce a novel architecture, ChoronoAdaptive Network (CANet), inspired by style-transfer techniques. The core of CANet is the Non-stationary Adaptive Normalization module, seamlessly integrating the Style Blending Gate and Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017). The Style Blending Gate preserves and reintegrates non-stationary characteristics, such as mean and standard deviation, by blending internal and external statistics, preventing over-stationarization while maintaining essential temporal dependencies. Coupled with AdaIN, which dynamically adapts the model to statistical changes, this approach enhances predictive accuracy under non-stationary conditions. CANet also employs multi-resolution patching to handle short-term fluctuations and long-term trends, along with Fourier analysis-based adaptive thresholding to reduce noise. A Stacked Kronecker Product Layer further optimizes the model's efficiency while maintaining high performance. Extensive experiments on real-world datasets validate CANet's superiority over state-of-the-art methods, achieving a 42% reduction in MSE and a 22% reduction in MAE. The source code is publicly available at https://github.com/mertsonmezer/CANet.
中文摘要:本文提出的ChoronoAdaptive网络(CANet)通过非平稳自适应归一化模块解决了时间序列预测中的分布漂移问题,在真实数据集上相比现有方法实现了42%的均方误差降低和22%的平均绝对误差降低。
English Summary: The proposed ChoronoAdaptive Network (CANet) addresses non-stationary time series forecasting challenges through its Non-stationary Adaptive Normalization module, achieving significant performance improvements with 42% lower MSE and 22% lower MAE compared to existing methods.

Authors:Haokai Zhang, Shengtao Zhang, Zijian Cai, Heng Wang, Ruixuan Zhu, Zinan Zeng, Minnan Luo
Title: Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection
Abstract:
Spoilers in movie reviews are important on platforms like IMDb and Rotten Tomatoes, offering benefits and drawbacks. They can guide some viewers' choices but also affect those who prefer no plot details in advance, making effective spoiler detection essential. Existing spoiler detection methods mainly analyze review text, often overlooking the impact of movie genres and user bias, limiting their effectiveness. To address this, we analyze movie review data, finding genre-specific variations in spoiler rates and identifying that certain users are more likely to post spoilers. Based on these findings, we introduce a new spoiler detection framework called GUSD (The code is available at https://github.com/AI-explorer-123/GUSD) (Genre-aware and User-specific Spoiler Detection), which incorporates genre-specific data and user behavior bias. User bias is calculated through dynamic graph modeling of review history. Additionally, the R2GFormer module combines RetGAT (Retentive Graph Attention Network) for graph information and GenreFormer for genre-specific aggregation. The GMoE (Genre-Aware Mixture of Experts) model further assigns reviews to specialized experts based on genre. Extensive testing on benchmark datasets shows that GUSD achieves state-of-the-art results. This approach advances spoiler detection by addressing genre and user-specific patterns, enhancing user experience on movie review platforms.
中文摘要:GUSD框架通过整合电影类型特征和用户行为偏差,利用动态图建模和类型感知模块,显著提升了影评中剧透检测的效果,实现了最先进的性能。
English Summary: The GUSD framework improves spoiler detection in movie reviews by incorporating genre-specific patterns and user behavior biases, achieving state-of-the-art results through dynamic graph modeling and specialized genre-aware modules.

Authors:Anirudhan Badrinath, Alex Yang, Kousik Rajesh, Prabhat Agarwal, Jaewon Yang, Haoyu Chen, Jiajing Xu, Charles Rosenberg
Title: OmniSage: Large Scale, Multi-Entity Heterogeneous Graph Representation Learning
Abstract:
Representation learning, a task of learning latent vectors to represent entities, is a key task in improving search and recommender systems in web applications. Various representation learning methods have been developed, including graph-based approaches for relationships among entities, sequence-based methods for capturing the temporal evolution of user activities, and content-based models for leveraging text and visual content. However, the development of a unifying framework that integrates these diverse techniques to support multiple applications remains a significant challenge. This paper presents OmniSage, a large-scale representation framework that learns universal representations for a variety of applications at Pinterest. OmniSage integrates graph neural networks with content-based models and user sequence models by employing multiple contrastive learning tasks to effectively process graph data, user sequence data, and content signals. To support the training and inference of OmniSage, we developed an efficient infrastructure capable of supporting Pinterest graphs with billions of nodes. The universal representations generated by OmniSage have significantly enhanced user experiences on Pinterest, leading to an approximate 2.5% increase in sitewide repins (saves) across five applications. This paper highlights the impact of unifying representation learning methods, and we make the model code publicly available at https://github.com/pinterest/atg-research/tree/main/omnisage.
中文: OmniSage是一个统一的表示学习框架,通过对比学习整合图神经网络、基于内容的模型和用户序列模型,在Pinterest应用中显著提升用户参与度,使全站转存量增加约2.5%。
English: OmniSage is a unified representation learning framework that integrates graph neural networks, content-based models, and user sequence models through contrastive learning, significantly improving user engagement on Pinterest with a 2.5% increase in repins across applications.

Authors:Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
Title: Step1X-Edit: A Practical Framework for General Image Editing
Abstract:
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
中文: 近年来,多模态模型如GPT-4o和Gemini2 Flash在图像编辑领域取得显著进展,但开源模型仍存在差距;为此,我们推出Step1X-Edit,采用多模态大语言模型和扩散解码器,在GEdit-Bench评估中性能接近领先专有模型,推动了该领域的发展。
English: Recent advancements in multimodal models like GPT-4o and Gemini2 Flash have enhanced image editing, but a performance gap remains with open-source alternatives, prompting the introduction of Step1X-Edit, which uses a Multimodal LLM and diffusion decoder to achieve comparable results and sets a new benchmark in the field.

Authors:Mingchen Jiang, Peng Xu, Xichen Ye, Xiaohui Chen, Yun Yang, Yifan Chen
Title: Embedding Empirical Distributions for Computing Optimal Transport Maps
Abstract:
Distributional data have become increasingly prominent in modern signal processing, highlighting the necessity of computing optimal transport (OT) maps across multiple probability distributions. Nevertheless, recent studies on neural OT methods predominantly focused on the efficient computation of a single map between two distributions. To address this challenge, we introduce a novel approach to learning transport maps for new empirical distributions. Specifically, we employ the transformer architecture to produce embeddings from distributional data of varying length; these embeddings are then fed into a hypernetwork to generate neural OT maps. Various numerical experiments were conducted to validate the embeddings and the generated OT maps. The model implementation and the code are provided on https://github.com/jiangmingchen/HOTET.
中文摘要:本文提出了一种基于Transformer的超网络方法,用于高效学习多个经验分布的最优传输映射,解决了现有方法主要计算两个分布间单一传输映射的局限性。
English summary: This paper introduces a transformer-based hypernetwork approach to efficiently learn optimal transport maps for multiple empirical distributions, addressing the limitation of existing methods that primarily compute single transport maps between two distributions.

Authors:Matthijs van der Lende, Jeremias Lino Ferrao, Niclas Müller-Hof
Title: Evaluating Uncertainty in Deep Gaussian Processes
Abstract:
Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at https://github.com/matthjs/xai-gp.
Chinese: 深度高斯过程和深度西格玛点过程在分布内校准表现优异,但在分布偏移下存在脆弱性,而深度集成方法在性能和校准方面均展现出更强的鲁棒性。
English: Deep Gaussian Processes and Deep Sigma Point Processes provide strong in-distribution calibration but show vulnerabilities under distribution shifts, whereas Deep Ensembles demonstrate superior robustness in both performance and calibration.

Authors:Óscar Escudero-Arnanz, Antonio G. Marques, Inmaculada Mora-Jiménez, Joaquín Álvarez-Rodríguez, Cristina Soguero-Ruiz
Title: Early Detection of Multidrug Resistance Using Multivariate Time Series Analysis and Interpretable Patient-Similarity Representations
Abstract:
Background and Objectives: Multidrug Resistance (MDR) is a critical global health issue, causing increased hospital stays, healthcare costs, and mortality. This study proposes an interpretable Machine Learning (ML) framework for MDR prediction, aiming for both accurate inference and enhanced explainability. Methods: Patients are modeled as Multivariate Time Series (MTS), capturing clinical progression and patient-to-patient interactions. Similarity among patients is quantified using MTS-based methods: descriptive statistics, Dynamic Time Warping, and Time Cluster Kernel. These similarity measures serve as inputs for MDR classification via Logistic Regression, Random Forest, and Support Vector Machines, with dimensionality reduction and kernel transformations improving model performance. For explainability, patient similarity networks are constructed from these metrics. Spectral clustering and t-SNE are applied to identify MDR-related subgroups and visualize high-risk clusters, enabling insight into clinically relevant patterns. Results: The framework was validated on ICU Electronic Health Records from the University Hospital of Fuenlabrada, achieving an AUC of 81%. It outperforms baseline ML and deep learning models by leveraging graph-based patient similarity. The approach identifies key risk factors -- prolonged antibiotic use, invasive procedures, co-infections, and extended ICU stays -- and reveals clinically meaningful clusters. Code and results are available at \https://github.com/oscarescuderoarnanz/DM4MTS. Conclusions: Patient similarity representations combined with graph-based analysis provide accurate MDR prediction and interpretable insights. This method supports early detection, risk factor identification, and patient stratification, highlighting the potential of explainable ML in critical care.
中文: 本研究提出了一种可解释的机器学习框架,通过患者相似性网络和多变量时间序列分析,实现了对多重耐药性的精准预测(AUC达81%),并识别出关键临床风险因素,为重症监护提供了可解释的临床洞见。
English: This study introduces an interpretable machine learning framework that uses patient similarity networks and multivariate time series analysis to accurately predict multidrug resistance, achieving an 81% AUC and identifying key clinical risk factors for enhanced explainability in critical care.

Authors:Honghao Li, Hanwei Li, Jing Zhang, Yi Zhang, Ziniu Yu, Lei Sang, Yiwen Zhang
Title: Quadratic Interest Network for Multimodal Click-Through Rate Prediction
Abstract:
Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at https://github.com/salmon1802/QIN.
中文: WWW 2025研讨会提出用于多模态点击率预测的二次兴趣网络模型,通过自适应注意力与二次网络有效捕捉多模态高阶特征交互,以0.9798的AUC值在竞赛中荣获第二名。
English: The WWW 2025 workshop introduces a Quadratic Interest Network (QIN) model for multimodal CTR prediction, which uses adaptive attention and quadratic networks to achieve second-place ranking with 0.9798 AUC by effectively capturing high-order feature interactions from multiple modalities.

Authors:Shengtao Zhang, Haokai Zhang, Shiqi Lou, Zicheng Wang, Zinan Zeng, Yilin Wang, Minnan Luo
Title: PTCL: Pseudo-Label Temporal Curriculum Learning for Label-Limited Dynamic Graph
Abstract:
Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations. In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp. However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection). In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data. To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available. PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function. We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph. Experiments across real-world scenarios demonstrate PTCL's consistent superiority over other methods adapted to this task. Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets. The code can be found at https://github.com/3205914485/FLiD.
中文: PTCL是一种仅使用最终时间戳标签进行动态节点分类的创新方法,通过伪标签生成和时间课程学习策略有效解决了标签稀缺问题。
English: PTCL is a novel method for dynamic node classification using only final timestamp labels, employing pseudo-label generation and temporal curriculum learning to overcome limited label availability.

Authors:Fengchun Liu, Tong Zhang, Chunying Zhang
Title: STCL:Curriculum learning Strategies for deep learning image steganography models
Abstract:
Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{https://github.com/chaos-boops/STCL}{https://github.com/chaos-boops/STCL}.
中文: 本文提出了一种隐写课程学习策略,通过从易到难的图像逐步训练模型,有效提升了隐写图像质量、加速了网络收敛,并在多个数据集上验证了其优越性能。
English: This paper introduces a Steganography Curriculum Learning (STCL) strategy that enhances image steganography models by training on progressively difficult images, improving quality and efficiency while reducing detectability across multiple datasets.

Authors:Ivan Rossi, Flavio Sartori, Cesare Rollo, Giovanni Birolo, Piero Fariselli, Tiziana Sanavia
Title: Beyond Cox Models: Assessing the Performance of Machine-Learning Methods in Non-Proportional Hazards and Non-Linear Survival Analysis
Abstract:
Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at https://github.com/compbiomed-unito/survhive.
Chinese: 本研究比较了机器学习、深度学习模型与Cox回归在生存分析中的表现,结果表明在特定条件下,使用Antolini一致性指数和Brier评分等恰当指标评估时,非线性和非比例风险模型能够优于传统方法。
English: This study compares machine and deep learning models with Cox regression for survival analysis, demonstrating that non-linear and non-proportional hazards models can outperform traditional methods under specific conditions when evaluated with appropriate metrics like Antolini's C-index and Brier's score.

Authors:Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, Peter Kiefer
Title: Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior
Abstract:
Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at https://github.com/lin102/CCGP.
Chinese: 本研究提出一种结合地理先验的无监督对比聚类模型,利用街景图像生成土地利用地图,为城市规划提供无需标注数据的可扩展定制化解决方案。
English: This study introduces an unsupervised contrastive clustering model with geographical priors to generate land use maps from street view images, offering a scalable and adaptable solution for urban planning without requiring labeled data.

Authors:Anyi Xiao, Cihui Yang
Title: TableCenterNet: A one-stage network for table structure recognition
Abstract:
Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at https://github.com/dreamy-xay/TableCenterNet.
中文摘要:本文提出TableCenterNet,一种单阶段端到端网络,首次将表格空间与逻辑结构解析统一为并行回归任务,通过共享特征提取和任务解码的协同架构,在跨场景适应性和计算效率上优于现有方法。
English Summary: This paper introduces TableCenterNet, a one-stage end-to-end network that unifies spatial and logical table structure parsing into parallel regression, achieving superior adaptability, robustness, and efficiency compared to existing methods.

Authors:Zihan Cheng, Jintao Guo, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao
Title: Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation
Abstract:
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.
Chinese: 本文提出Mamba-Sea框架,通过全局到局部的序列增强策略提升医学图像分割的领域泛化能力,在Prostate数据集上首次突破90%的Dice系数,超越了现有最佳性能。
English: This paper introduces Mamba-Sea, a novel Mamba-based framework that enhances domain generalization in medical image segmentation through global-to-local sequence augmentation, achieving state-of-the-art performance with over 90% Dice coefficient on the Prostate dataset.

Authors:Mingqi Yuan, Qi Wang, Guozheng Ma, Bo Li, Xin Jin, Yunbo Wang, Xiaokang Yang, Wenjun Zeng, Dacheng Tao
Title: Plasticine: Accelerating Research in Plasticity-Motivated Deep Reinforcement Learning
Abstract:
Developing lifelong learning agents is crucial for artificial general intelligence. However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for benchmarking plasticity optimization in deep RL. Plasticine provides single-file implementations of over 13 mitigation methods, 10 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to open-ended environments. This framework enables researchers to systematically quantify plasticity loss, evaluate mitigation strategies, and analyze plasticity dynamics across different contexts. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/Plasticine.
中文: Plasticine是首个用于深度强化学习中可塑性优化的开源基准框架,它提供了多种缓解方法、评估指标和学习场景,以系统性应对可塑性损失问题。
English: Plasticine is introduced as the first open-source framework to benchmark plasticity optimization in deep RL, offering implementations of mitigation methods, evaluation metrics, and learning scenarios to systematically address plasticity loss.

Authors:Vipin Singh, Tianheng Ling, Teodor Chiaburu, Felix Biessmann
Title: Evaluating Time Series Models for Urban Wastewater Management: Predictive Performance, Model Complexity and Resilience
Abstract:
Climate change increases the frequency of extreme rainfall, placing a significant strain on urban infrastructures, especially Combined Sewer Systems (CSS). Overflows from overburdened CSS release untreated wastewater into surface waters, posing environmental and public health risks. Although traditional physics-based models are effective, they are costly to maintain and difficult to adapt to evolving system dynamics. Machine Learning (ML) approaches offer cost-efficient alternatives with greater adaptability. To systematically assess the potential of ML for modeling urban infrastructure systems, we propose a protocol for evaluating Neural Network architectures for CSS time series forecasting with respect to predictive performance, model complexity, and robustness to perturbations. In addition, we assess model performance on peak events and critical fluctuations, as these are the key regimes for urban wastewater management. To investigate the feasibility of lightweight models suitable for IoT deployment, we compare global models, which have access to all information, with local models, which rely solely on nearby sensor readings. Additionally, to explore the security risks posed by network outages or adversarial attacks on urban infrastructure, we introduce error models that assess the resilience of models. Our results demonstrate that while global models achieve higher predictive performance, local models provide sufficient resilience in decentralized scenarios, ensuring robust modeling of urban infrastructure. Furthermore, models with longer native forecast horizons exhibit greater robustness to data perturbations. These findings contribute to the development of interpretable and reliable ML solutions for sustainable urban wastewater management. The implementation is available in our GitHub repository.
中文: 气候变化加剧了极端降雨,给城市合流制污水系统带来压力,本研究提出评估机器学习模型的协议,发现全局模型虽预测更优,但局部模型在分散式场景中更具韧性,有助于可持续污水管理。
English: Climate change exacerbates extreme rainfall, stressing urban sewer systems, and this study proposes a protocol to evaluate machine learning models for forecasting, finding that while global models perform better, local models offer resilience for decentralized, sustainable wastewater management.

Authors:De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, Jan Kautz
Title: FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
Abstract:
There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG
中文: 提出的FRAG框架通过独立选择相关帧而无需长上下文模型,有效提升了现有大模型在长视频和长文档任务上的性能表现。
English: The proposed FRAG framework enhances long input processing by selecting relevant frames independently without long context models, significantly improving performance on both video and document tasks using existing LMMs.

Authors:Francesc Marti-Escofet, Benedikt Blumenstiel, Linus Scheibenreif, Paolo Fraccaro, Konrad Schindler
Title: Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models
Abstract:
Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.
中文: 参数高效微调(PEFT)技术通过匹配甚至超越全参数微调的性能,在降低计算成本的同时提升模型对未知地理区域的泛化能力,为地球观测任务提供了有效的预训练模型适配方案。
English: Parameter-Efficient Fine-Tuning (PEFT) techniques effectively adapt large foundation models for Earth observation tasks by matching or surpassing full fine-tuning performance while reducing computational costs and enhancing generalization to new regions.

Authors:Jihyun Lee, Yejin Jeon, Seungyeon Seo, Gary Geunbae Lee
Title: PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
Abstract:
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.
中文: PicPersona-TOD是一种创新数据集,通过整合用户图像实现个性化对话回复,借助定制化互动提升用户参与度并减少通用性回答。
English: PicPersona-TOD is a novel dataset that integrates user images to enable personalized dialogue responses, improving user engagement through tailored interactions and reducing generic outputs.

Authors:Hassan Keshvarikhojasteh, Mihail Tifrea, Sibylle Hess, Josien P. W. Pluim, Mitko Veta
Title: A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology
Abstract:
Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks. Our code is available at \href{https://github.com/tueimage/GABMIL}{\texttt{GABMIL}}.
中文: 本研究提出的GABMIL模型通过显式建模病理图像中斑块间的相互作用,在保持计算效率的同时显著提升了肿瘤分型的分类性能。
English: This study introduces GABMIL, an enhanced version of ABMIL that explicitly models inter-patch dependencies in whole slide images, achieving significant performance improvements in tumor subtyping with minimal computational overhead.

Authors:Yongxuan Wu, Runyu Chen, Peiyu Liu, Hongjin Qian
Title: LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
Abstract:
Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at https://github.com/Yarayx/livelongbench.
中文: 本研究构建了首个基于直播的口语长文本数据集,以弥补现有基准在反映真实对话冗余性和复杂性方面的不足,发现当前方法在处理高冗余输入时表现不佳,并提出一种新基线在各项任务中均取得强劲性能。
English: This study introduces the first spoken long-text dataset from live streams to address the limitations of current benchmarks in capturing the redundancy and conversational complexity of real-world dialogues, revealing that existing methods struggle with highly redundant inputs and proposing a new baseline that improves performance across tasks.

Authors:Xiuying Chen, Tairan Wang, Juexiao Zhou, Zirui Song, Xin Gao, Xiangliang Zhang
Title: Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Abstract:
Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at https://github.com/iriscxy/GenFair.
中文: 医疗文本生成中的人工智能系统在不同人口群体间存在性能差异,而提出的选择性优化算法在保持整体准确率波动不超过2%的同时,将各类指标下的群体差异降低了30%以上。
English: AI systems in medical text generation exhibit performance disparities across demographic groups, but a proposed selective optimization algorithm effectively reduces bias by over 30% while maintaining overall accuracy within 2% variation.

Authors:Yinqi Li, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
Title: DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks
Abstract:
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .
中文摘要:本文通过逆向应用预训练的布局到图像扩散模型,将生成式扩散模型的判别能力从分类扩展到更复杂的物体检测任务,提出的优化方法在COCO数据集上达到基础判别模型水平,并大幅加速了基于扩散的分类方法且保持精度。
English Summary: This paper extends pretrained diffusion models from generative to discriminative tasks, specifically object detection, by inverting a layout-to-image model and introducing optimization techniques that achieve competitive performance on COCO while accelerating classification without accuracy loss.

Authors:Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang
Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
中文摘要:PaperCoder是一个多智能体大语言模型框架,通过规划、分析和生成三个阶段将机器学习论文自动转化为功能性代码库,在人工评估和基准测试中均展现出卓越性能。
English Summary: PaperCoder is a multi-agent LLM framework that automatically converts machine learning papers into functional code repositories through planning, analysis, and generation phases, demonstrating superior performance in both human and benchmark evaluations.

Authors:Radha Lahoti, M. Khalid Jawed
Title: MAT-DiSMech: A Discrete Differential Geometry-based Computational Tool for Simulation of Rods, Shells, and Soft Robots
Abstract:
Accurate and efficient simulation tools are essential in robotics, enabling the visualization of system dynamics and the validation of control laws before committing resources to physical experimentation. Developing physically accurate simulation tools is particularly challenging in soft robotics, largely due to the prevalence of geometrically nonlinear deformation. A variety of robot simulators tackle this challenge by using simplified modeling techniques -- such as lumped mass models -- which lead to physical inaccuracies in real-world applications. On the other hand, high-fidelity simulation methods for soft structures, like finite element analysis, offer increased accuracy but lead to higher computational costs. In light of this, we present a Discrete Differential Geometry-based simulator that provides a balance between physical accuracy and computational speed. Building on an extensive body of research on rod and shell-based representations of soft robots, our tool provides a pathway to accurately model soft robots in a computationally tractable manner. Our open-source MATLAB-based framework is capable of simulating the deformations of rods, shells, and their combinations, primarily utilizing implicit integration techniques. The software design is modular for the user to customize the code, for example, add new external forces and impose custom boundary conditions. The implementations for prevalent forces encountered in robotics, including gravity, contact, kinetic and viscous friction, and aerodynamic drag, have been provided. We provide several illustrative examples that showcase the capabilities and validate the physical accuracy of the simulator. The open-source code is available at https://github.com/StructuresComp/dismech-matlab.git. We anticipate that the proposed simulator can serve as an effective digital twin tool, enhancing the Sim2Real pathway in soft robotics research.
Chinese: 本文提出了一种基于离散微分几何的模拟器,在软体机器人仿真中平衡了物理精度与计算效率,提供了一个开源的MATLAB框架,能够模拟杆件、壳体及多种力,以强化仿真到现实的转化路径。
English: This paper introduces a Discrete Differential Geometry-based simulator that balances physical accuracy and computational efficiency for soft robotics, offering an open-source MATLAB framework capable of modeling rods, shells, and various forces to enhance Sim2Real applications.

Authors:Kai Cui, Jia Li, Yu Liu, Xuesong Zhang, Zhenzhen Hu, Meng Wang
Title: PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
Abstract:
Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.
中文摘要:PhysioSync是一种新型预训练框架,通过建模脑电与外周生理信号的跨模态动态同步关系,并在多时间分辨率下捕捉情绪波动,显著提升了基于脑电的情绪识别性能。
English Summary: PhysioSync is a novel pre-training framework that enhances EEG-based emotion recognition by modeling dynamic cross-modal synchronization with peripheral physiological signals and capturing emotional fluctuations across multiple temporal resolutions.

Authors:Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
Title: OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection
Abstract:
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at https://github.com/AlbertoFdezHdez/OUI.
中文: 过拟合-欠拟合指示器(OUI)是一种创新工具,无需验证数据即可通过监测训练动态有效识别最佳权重衰减超参数,从而及早发现过拟合或欠拟合问题以提升模型泛化能力。
English: The Overfitting-Underfitting Indicator (OUI) is a novel tool that efficiently identifies optimal Weight Decay hyperparameters by monitoring training dynamics without validation data, enabling early detection of overfitting or underfitting for improved model generalization.

Authors:Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Title: MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation
Abstract:
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.
中文: MIRAGE 是一个专为检索增强生成系统评估设计的问答数据集,包含精心筛选的实例和新型评估指标,能够全面衡量不同检索器与大语言模型组合的适应性与动态特性。
English: MIRAGE is a specialized dataset for evaluating Retrieval-Augmented Generation systems, featuring curated question-answer pairs and novel metrics to assess adaptability across different retriever-LLM configurations.

Authors:Ning Li, Antai Andy Liu, Jingran Zhang, Justin Cui
Title: Latent Video Dataset Distillation
Abstract:
Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase. Our code is available at https://github.com/liningresearch/Latent_Video_Dataset_Distillation.
中文: 本研究提出了一种新颖的视频数据集蒸馏方法,在潜在空间中使用变分编码器并结合多样性感知样本选择,实现了最先进的性能,相比现有方法有显著提升。
English: This study introduces a novel video dataset distillation method that operates in the latent space using a variational encoder and incorporates diversity-aware sample selection, achieving state-of-the-art performance with significant improvements over prior methods.

Authors:Hannah Cyberey, David Evans
Title: Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Abstract:
Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering
中文: 本研究运用表征工程技术识别并操控安全调优语言模型中的审查向量,从而控制模型的拒绝行为,并揭示可通过逆向应用向量消除审查的思维抑制机制。
English: This study uses representation engineering to identify and manipulate censorship vectors in safety-tuned language models, enabling control over refusal behaviors and revealing thought suppression mechanisms that can be reversed to eliminate censorship.

Authors:Kartikay Tehlan, Thomas Wendler
Title: Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PET
Abstract:
Dynamic positron emission tomography (PET) with [$^{18}$F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [$^{18}$F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.
中文: 本研究提出了一种基于隐式神经表示的生理神经表示方法,用于动态PET成像中的个性化动力学参数估计,结合解剖先验知识,相比传统方法实现了更高分辨率、更低误差和更好的解剖一致性。
English: This study introduces a physiological neural representation using implicit neural representations (INRs) for personalized kinetic parameter estimation in dynamic PET imaging, integrating anatomical priors to achieve higher resolution, lower error, and improved anatomical consistency compared to conventional methods.

Authors:Valentin Langer, Kartikay Tehlan, Thomas Wendler
Title: Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentation
Abstract:
Accurate kinetic analysis of [$^{18}$F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [$^{18}$F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of $13.39\%$ for the liver and $10.42\%$ for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.
中文: 本研究提出一种多器官分割方法,通过整合来自多个血管的图像衍生输入函数来改进动态PET成像中的动力学建模,在肝脏和肺部分析中实现了显著的误差降低。
English: This study introduces a multi-organ segmentation method that integrates image-derived input functions from multiple vessels to enhance kinetic modeling in dynamic PET imaging, achieving significant error reduction in liver and lung analyses.

Authors:Joohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto Horowitz
Title: Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators
Abstract:
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.
中文: 本文提出了SE(3)流形上的几何统一力阻抗控制框架,通过能量储罐增强确保被动性和力跟踪能力,并利用速度与力场解决了非因果实现问题。
English: This paper introduces a geometric unified force-impedance control (GUFIC) framework on the SE(3) manifold that ensures passivity and force tracking through energy tank augmentation and resolves non-causal implementation issues with velocity and force fields.

Authors:Dongjin Seo, Soobin Um, Sangbin Lee, Jong Chul Ye, Haejun Chung
Title: Physics-guided and fabrication-aware inverse design of photonic devices using diffusion models
Abstract:
Designing free-form photonic devices is fundamentally challenging due to the vast number of possible geometries and the complex requirements of fabrication constraints. Traditional inverse-design approaches--whether driven by human intuition, global optimization, or adjoint-based gradient methods--often involve intricate binarization and filtering steps, while recent deep learning strategies demand prohibitively large numbers of simulations (10^5 to 10^6). To overcome these limitations, we present AdjointDiffusion, a physics-guided framework that integrates adjoint sensitivity gradients into the sampling process of diffusion models. AdjointDiffusion begins by training a diffusion network on a synthetic, fabrication-aware dataset of binary masks. During inference, we compute the adjoint gradient of a candidate structure and inject this physics-based guidance at each denoising step, steering the generative process toward high figure-of-merit (FoM) solutions without additional post-processing. We demonstrate our method on two canonical photonic design problems--a bent waveguide and a CMOS image sensor color router--and show that our method consistently outperforms state-of-the-art nonlinear optimizers (such as MMA and SLSQP) in both efficiency and manufacturability, while using orders of magnitude fewer simulations (approximately 2 x 10^2) than pure deep learning approaches (approximately 10^5 to 10^6). By eliminating complex binarization schedules and minimizing simulation overhead, AdjointDiffusion offers a streamlined, simulation-efficient, and fabrication-aware pipeline for next-generation photonic device design. Our open-source implementation is available at https://github.com/dongjin-seo2020/AdjointDiffusion.
Chinese: AdjointDiffusion是一种将伴随梯度融入扩散模型的物理引导框架,能以远超传统方法的仿真效率实现可制造的光子器件设计。
English: AdjointDiffusion is a physics-guided framework that integrates adjoint gradients into diffusion models, enabling efficient and fabrication-aware photonic device design with significantly fewer simulations than traditional methods.

Authors:Xinqi Xiong, Andrea Dunn Beltran, Jun Myeong Choi, Marc Niethammer, Roni Sengupta
Title: PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation
Abstract:
Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.
中文摘要:本研究提出一种新颖的图像转换框架,通过将稳定扩散模型与ControlNet结合,并利用逐像素着色图作为结构约束,能够生成更真实的内窥镜图像并显著提升深度估计精度。
English Summary: This study introduces an innovative image-to-image translation framework that combines Stable Diffusion with ControlNet, using Per-Pixel Shading maps as structural constraints to generate realistic endoscopic images and improve depth estimation performance.

Authors:Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
Title: A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
Abstract:
Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: https://github.com/emrecanacikgoz/awesome-conversational-agents.
中文: 本综述将基于大语言模型的对话智能体划分为推理、监控与控制三大维度,通过构建新分类法指出研究空白与未来方向,以推动实现类人智能与通用人工智能的进展。
English: This survey organizes LLM-driven conversational agents into reasoning, monitoring, and control dimensions, proposing a taxonomy to address research gaps and future directions toward achieving human-like intelligence and AGI.

Authors:David Yan, Alexander Raistrick, Jia Deng
Title: Procedural Dataset Generation for Zero-Shot Stereo Matching
Abstract:
Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets. Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works. We open source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.
中文: 本研究推出了专为零样本立体匹配优化的程序化生成数据集Infinigen-Stereo,仅使用该数据训练的模型超越了现有合成数据集组合的效果,并显著提升了零样本立体匹配性能。
English: The study introduces Infinigen-Stereo, a procedurally generated dataset optimized for zero-shot stereo matching, which outperforms existing synthetic datasets and enhances model performance when trained solely on its data.

Authors:Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
Title: Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Abstract:
Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
中文摘要:本研究提出广义邻域注意力(GNA)解决稀疏注意力效率问题,通过优化Blackwell架构实现,在生成模型中最高可获得46%的端到端加速效果。
English Summary: The study introduces Generalized Neighborhood Attention (GNA) to address sparse attention inefficiencies, achieving up to 46% speedup on generative models through optimized Blackwell architecture implementation.

Authors:Hanwen Du, Bo Peng, Xia Ning
Title: Planning with Diffusion Models for Target-Oriented Dialogue Systems
Abstract:
Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance toward diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through https://github.com/ninglab/DiffTOD.
中文:DiffTOD提出了一种创新的对话规划框架,利用扩散模型实现非顺序规划,通过针对不同目标类型的定制引导机制,有效解决了目标导向对话中的复合错误和短视行为,并在多样场景中展现出强大灵活性。
English: DiffTOD introduces a novel dialogue planning framework using diffusion models to enable non-sequential planning, addressing compounding errors and myopic actions in Target-Oriented Dialogue by optimizing strategies through tailored guidance mechanisms across diverse scenarios.

Authors:Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Title: Process Reward Models That Think
Abstract:
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
中文: ThinkPRM是一种生成式长思维链验证器,仅需1%的过程标注即可在多个基准测试中超越现有方法,以极少的监督实现高效验证计算扩展。
English: ThinkPRM is a generative, long chain-of-thought verifier that achieves superior performance across multiple benchmarks using only 1% of process labels, effectively scaling test-time verification with minimal supervision.

Authors:Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang
Title: Decoupled Global-Local Alignment for Improving Compositional Understanding
Abstract:
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA
中文: DeGLA框架通过自蒸馏机制和新型对比损失,在提升CLIP组合理解能力的同时有效保留其通用性能,在多项基准测试中实现了显著性能提升。
English: The DeGLA framework enhances CLIP's compositional understanding through self-distillation and novel contrastive losses while preserving its general capabilities, achieving significant performance gains on multiple benchmarks.

Authors:Jialiang Zhang, Feng Gao, Yanhai Gan, Junyu Dong, Qian Du
Title: Frequency-Compensated Network for Daily Arctic Sea Ice Concentration Prediction
Abstract:
Accurately forecasting sea ice concentration (SIC) in the Arctic is critical to global ecosystem health and navigation safety. However, current methods still is confronted with two challenges: 1) these methods rarely explore the long-term feature dependencies in the frequency domain. 2) they can hardly preserve the high-frequency details, and the changes in the marginal area of the sea ice cannot be accurately captured. To this end, we present a Frequency-Compensated Network (FCNet) for Arctic SIC prediction on a daily basis. In particular, we design a dual-branch network, including branches for frequency feature extraction and convolutional feature extraction. For frequency feature extraction, we design an adaptive frequency filter block, which integrates trainable layers with Fourier-based filters. By adding frequency features, the FCNet can achieve refined prediction of edges and details. For convolutional feature extraction, we propose a high-frequency enhancement block to separate high and low-frequency information. Moreover, high-frequency features are enhanced via channel-wise attention, and temporal attention unit is employed for low-frequency feature extraction to capture long-range sea ice changes. Extensive experiments are conducted on a satellite-derived daily SIC dataset, and the results verify the effectiveness of the proposed FCNet. Our codes and data will be made public available at: https://github.com/oucailab/FCNet .
中文: 频率补偿网络(FCNet)通过结合频域分析和卷积特征,提升了北极海冰密集度的预测精度,有效捕捉长期依赖关系并保留边缘区域的高频细节。
English: The Frequency-Compensated Network (FCNet) enhances Arctic sea ice concentration forecasting by integrating frequency-domain analysis and convolutional features to capture long-term dependencies and preserve high-frequency details in marginal ice zones.

Authors:Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Title: IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
Abstract:
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System
中文: 本文提出开源平台IRIS,通过自适应计算和文献整合等功能将人类反馈与大型语言模型相结合,有效提升科学假说生成能力,并经过跨学科用户研究验证。
English: This paper introduces IRIS, an open-source platform that enhances scientific hypothesis generation by integrating human feedback with LLMs through features like adaptive computation and literature synthesis, validated by a cross-disciplinary user study.

Authors:Xinru Meng, Han Sun, Jiamei Liu, Ningzhong Liu, Huiyu Zhou
Title: Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation
Abstract:
Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.
中文:提出的基于能量的伪标签细化方法通过能量阈值筛选伪标签并采用对比学习增强特征区分度,有效解决了无源域适应中的噪声问题,在多个基准数据集上展现出优越性能。
English: The proposed Energy-Based Pseudo-Label Refining (EBPR) method addresses noise in source-free domain adaptation by filtering pseudo-labels using energy thresholds and enhancing feature discrimination through contrastive learning, demonstrating superior performance on benchmark datasets.

Authors:Gerardus Croonen, Andreas Trondl, Julia Simon, Daniel Steininger
Title: SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storage Characteristics of Sugar Beets
Abstract:
While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.
Chinese Summary: 本研究提出了一种新颖的标注数据集和两阶段方法,用于在单目RGB图像中检测、语义分割和估算收获后储存的甜菜质量,以应对因微生物和多余植被导致的储存期间糖分损失问题。
English Summary: This study introduces a new annotated dataset and a two-stage method for detecting, segmenting, and estimating the mass of sugar beets in images to address sugar loss during storage caused by microorganisms and vegetation.

Authors:Ceren Yildirim, Kamer Kaya, Sinan Yildirim, Erkay Savas
Title: MCMC for Bayesian estimation of Differential Privacy from Membership Inference Attacks
Abstract:
We propose a new framework for Bayesian estimation of differential privacy, incorporating evidence from multiple membership inference attacks (MIA). Bayesian estimation is carried out via a Markov chain Monte Carlo (MCMC) algorithm, named MCMC-DP-Est, which provides an estimate of the full posterior distribution of the privacy parameter (e.g., instead of just credible intervals). Critically, the proposed method does not assume that privacy auditing is performed with the most powerful attack on the worst-case (dataset, challenge point) pair, which is typically unrealistic. Instead, MCMC-DP-Est jointly estimates the strengths of MIAs used and the privacy of the training algorithm, yielding a more cautious privacy analysis. We also present an economical way to generate measurements for the performance of an MIA that is to be used by the MCMC method to estimate privacy. We present the use of the methods with numerical examples with both artificial and real data.
Chinese: 作者提出了一种贝叶斯框架,通过结合多种成员推理攻击的证据来估计差分隐私,采用名为MCMC-DP-Est的马尔可夫链蒙特卡罗算法,在不依赖不切实际的最坏情况假设下,获得隐私参数的完整后验分布。
English: The authors introduce a Bayesian framework for estimating differential privacy using multiple membership inference attacks, employing an MCMC algorithm called MCMC-DP-Est to derive the full posterior distribution of privacy parameters without relying on unrealistic worst-case assumptions.

Authors:Wenping Ma, Boyou Xue, Mengru Ma, Chuang Chen, Hekai Zhang, Hao Zhu
Title: A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification
Abstract:
Multispectral (MS) and panchromatic (PAN) images describe the same land surface, so these images not only have their own advantages, but also have a lot of similar information. In order to separate these similar information and their respective advantages, reduce the feature redundancy in the fusion stage. This paper introduces a diff-attention aware state space fusion model (DAS2F-Model) for multimodal remote sensing image classification. Based on the selective state space model, a cross-modal diff-attention module (CMDA-Module) is designed to extract and separate the common features and their respective dominant features of MS and PAN images. Among this, space preserving visual mamba (SPVM) retains image spatial features and captures local features by optimizing visual mamba's input reasonably. Considering that features in the fusion stage will have large semantic differences after feature separation and simple fusion operations struggle to effectively integrate these significantly different features, an attention-aware linear fusion module (AALF-Module) is proposed. It performs pixel-wise linear fusion by calculating influence coefficients. This mechanism can fuse features with large semantic differences while keeping the feature size unchanged. Empirical evaluations indicate that the presented method achieves better results than alternative approaches. The relevant code can be found at:https://github.com/AVKSKVL/DAS-F-Model
本文提出了一种DAS2F模型,通过跨模态注意力模块分离多光谱与全色图像的共有及独有特征,并采用注意力感知的线性融合方法进行有效整合,从而提升遥感图像分类性能。
This paper introduces a DAS2F-Model that separates common and distinct features of multispectral and panchromatic images using a cross-modal attention module and fuses them with an attention-aware linear method to improve remote sensing classification.

Authors:William Corrias, Fabio De Gaspari, Dorjan Hitaj, Luigi V. Mancini
Title: MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark
Abstract:
Recent advances in generative models have led to their application in password guessing, with the aim of replicating the complexity, structure, and patterns of human-created passwords. Despite their potential, inconsistencies and inadequate evaluation methodologies in prior research have hindered meaningful comparisons and a comprehensive, unbiased understanding of their capabilities. This paper introduces MAYA, a unified, customizable, plug-and-play benchmarking framework designed to facilitate the systematic characterization and benchmarking of generative password-guessing models in the context of trawling attacks. Using MAYA, we conduct a comprehensive assessment of six state-of-the-art approaches, which we re-implemented and adapted to ensure standardization. Our evaluation spans eight real-world password datasets and covers an exhaustive set of advanced testing scenarios, totaling over 15,000 compute hours. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, the diverse password distributions learned by the models enable a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark generative password-guessing models. Our framework is publicly available at https://github.com/williamcorrias/MAYA-Password-Benchmarking.
中文: 本文提出了MAYA这一统一基准测试框架,用于评估生成式密码猜测模型,通过广泛测试发现序列模型优于其他方法且多模型攻击效果最佳,该框架已公开以推动标准化研究。
English: This paper introduces MAYA, a unified benchmarking framework for evaluating generative password-guessing models, revealing through extensive testing that sequential models outperform other methods and that multi-model attacks are most effective, with the framework publicly released to standardize future research.

Authors:Antonios Tragoudaras, Theofanis Aslanidis, Emmanouil Georgios Lionis, Marina Orozco González, Panagiotis Eustratiadis
Title: Information Leakage of Sentence Embeddings via Generative Embedding Inversion Attacks
Abstract:
Text data are often encoded as dense vectors, known as embeddings, which capture semantic, syntactic, contextual, and domain-specific information. These embeddings, widely adopted in various applications, inherently contain rich information that may be susceptible to leakage under certain attacks. The GEIA framework highlights vulnerabilities in sentence embeddings, demonstrating that they can reveal the original sentences they represent. In this study, we reproduce GEIA's findings across various neural sentence embedding models. Additionally, we contribute new analysis to examine whether these models leak sensitive information from their training datasets. We propose a simple yet effective method without any modification to the attacker's architecture proposed in GEIA. The key idea is to examine differences between log-likelihood for masked and original variants of data that sentence embedding models have been pre-trained on, calculated on the embedding space of the attacker. Our findings indicate that following our approach, an adversary party can recover meaningful sensitive information related to the pre-training knowledge of the popular models used for creating sentence embeddings, seriously undermining their security. Our code is available on: https://github.com/taslanidis/GEIA
中文摘要:该研究复现了GEIA框架关于句子嵌入漏洞的发现,并提出通过分析嵌入空间中掩码与原数据对数似然差异的新方法,成功揭露流行模型训练数据中的敏感信息,凸显其严重安全隐患。
English Summary: The study reproduces the GEIA framework's findings on sentence embedding vulnerabilities and introduces a novel method to expose sensitive training data by analyzing log-likelihood differences in embedding spaces, revealing critical security risks in popular models.

Authors:Xu Guo, Tong Zhang, Fuyun Wang, Xudong Wang, Xiaoya Zhang, Xin Liu, Zhen Cui
Title: MMHCL: Multi-Modal Hypergraph Contrastive Learning for Recommendation
Abstract:
The burgeoning presence of multimodal content-sharing platforms propels the development of personalized recommender systems. Previous works usually suffer from data sparsity and cold-start problems, and may fail to adequately explore semantic user-product associations from multimodal data. To address these issues, we propose a novel Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework for user recommendation. For a comprehensive information exploration from user-product relations, we construct two hypergraphs, i.e. a user-to-user (u2u) hypergraph and an item-to-item (i2i) hypergraph, to mine shared preferences among users and intricate multimodal semantic resemblance among items, respectively. This process yields denser second-order semantics that are fused with first-order user-item interaction as complementary to alleviate the data sparsity issue. Then, we design a contrastive feature enhancement paradigm by applying synergistic contrastive learning. By maximizing/minimizing the mutual information between second-order (e.g. shared preference pattern for users) and first-order (information of selected items for users) embeddings of the same/different users and items, the feature distinguishability can be effectively enhanced. Compared with using sparse primary user-item interaction only, our MMHCL obtains denser second-order hypergraphs and excavates more abundant shared attributes to explore the user-product associations, which to a certain extent alleviates the problems of data sparsity and cold-start. Extensive experiments have comprehensively demonstrated the effectiveness of our method. Our code is publicly available at: https://github.com/Xu107/MMHCL.
中文:提出的多模态超图对比学习(MMHCL)框架通过构建用户和物品超图挖掘共享偏好与语义关联,并采用对比学习增强特征区分度,有效缓解了推荐系统中数据稀疏和冷启动问题。
English: The proposed Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework addresses data sparsity and cold-start issues in personalized recommendation by constructing user and item hypergraphs to mine shared preferences and semantic similarities, then applying contrastive learning to enhance feature distinguishability.

Authors:Junjie Chen, Haitao Li, Jingli Yang, Yiqun Liu, Qingyao Ai
Title: Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution
Abstract:
Intelligent agent systems based on Large Language Models (LLMs) have shown great potential in real-world applications. However, existing agent frameworks still face critical limitations in task planning and execution, restricting their effectiveness and generalizability. Specifically, current planning methods often lack clear global goals, leading agents to get stuck in local branches, or produce non-executable plans. Meanwhile, existing execution mechanisms struggle to balance complexity and stability, and their limited action space restricts their ability to handle diverse real-world tasks. To address these limitations, we propose GoalAct, a novel agent framework that introduces a continuously updated global planning mechanism and integrates a hierarchical execution strategy. GoalAct decomposes task execution into high-level skills, including searching, coding, writing and more, thereby reducing planning complexity while enhancing the agents' adaptability across diverse task scenarios. We evaluate GoalAct on LegalAgentBench, a benchmark with multiple types of legal tasks that require the use of multiple types of tools. Experimental results demonstrate that GoalAct achieves state-of-the-art (SOTA) performance, with an average improvement of 12.22% in success rate. These findings highlight GoalAct's potential to drive the development of more advanced intelligent agent systems, making them more effective across complex real-world applications. Our code can be found at https://github.com/cjj826/GoalAct.
Chinese: GoalAct框架通过引入持续更新的全局规划机制和分层执行策略,解决了现有智能代理在任务规划与执行中的局限性,在法律任务基准测试中以12.22%的成功率提升达到最优性能。
English: The GoalAct framework addresses limitations in LLM-based agents by introducing a global planning mechanism and hierarchical execution strategy, achieving state-of-the-art performance with a 12.22% success rate improvement on legal tasks.

Authors:Wei Zhou, Xiong Xu, Changzheng Wei, Ying Yan, Wei Tang, Zhihao Chen, Xuebing Huang, Wengang Chen, Jie Zhang, Yang Chen, Xiaofu Zheng, Hanghang Wu, Shenglong Chen, Ermei Wang, Xiangfei Chen, Yang Yu, Meng Wu, Tao Zhu, Liwei Yuan, Feng Yu, Alex Zhang, Wei Wang, Ji Luo, Zhengyu He, Wenbiao Zhao
Title: DTVM: Revolutionizing Smart Contract Execution with Determinism and Compatibility
Abstract:
We introduce the DeTerministic Virtual Machine (DTVM) Stack, a next-generation smart contract execution framework designed to address critical performance, determinism, and ecosystem compatibility challenges in blockchain networks. Building upon WebAssembly (Wasm) while maintaining full Ethereum Virtual Machine (EVM) ABI compatibility, DTVM introduces a Deterministic Middle Intermediate Representation (dMIR) and a hybrid lazy-JIT compilation engine to balance compilation speed and execution efficiency. DTVM further accommodates diverse instruction set architectures (e.g., EVM, RISC-V) through modular adaptation layers. This enables seamless integration with DTVM's hybrid lazy-JIT compilation engine, which dynamically optimizes performance while preserving deterministic execution guarantees across heterogeneous environments. The key contributions including: 1). The framework achieves up to 2$\times$ acceleration over evmone in dominant Ethereum contract (e.g. ERC20/721/1155) execution and reduces fibonacci computation latency by 11.8$\sim$40.5% compared to Wasm based VMs. 2). A novel trampoline hot-switch mechanism enables sub-millisecond (0.95ms) post-deployment invocation times, outperforming up to about 23$\times$ in compilation and invocation efficiency. 3). It supports multi-language development (Solidity, C++, Rust, Java, Go, and AssemblyScript) through unified bytecode conversion while maintaining EVM ABI compatibility for seamless invocation. It reduces machine code object sizes by 30.0$\sim$72.6%, coupled with a minimized Trusted Computing Base. 4). It offers SmartCogent, an AI-driven full-stack development experience, leveraging fine-tuned LLMs and retrieval-augmented generation to automate tasks across the smart contract lifecycle: development, debugging, security auditing, and deployment. DTVM Stack has been open-sourced (https://github.com/DTVMStack).
中文: DTVM Stack 是新一代智能合约执行框架,通过混合惰性即时编译引擎和模块化架构提升性能与确定性,在以太坊合约执行中实现显著加速,同时支持多语言开发和AI驱动的全栈开发体验。
English: The DTVM Stack is a next-generation smart contract execution framework that enhances performance and determinism through a hybrid lazy-JIT compilation engine and modular architecture, achieving significant speedups in Ethereum contract execution while supporting multi-language development and AI-driven tools.

Authors:Seungyoon Choi, Sein Kim, Hongseok Kang, Wonjoong Kim, Chanyoung Park
Title: Dynamic Time-aware Continual User Representation Learning
Abstract:
Traditional user modeling (UM) approaches have primarily focused on designing models for a single specific task, but they face limitations in generalization and adaptability across various tasks. Recognizing these challenges, recent studies have shifted towards continual learning (CL)-based universal user representation learning aiming to develop a single model capable of handling multiple tasks. Despite advancements, existing methods are in fact evaluated under an unrealistic scenario that does not consider the passage of time as tasks progress, which overlooks newly emerged items that may change the item distribution of previous tasks. In this paper, we introduce a practical evaluation scenario on which CL-based universal user representation learning approaches should be evaluated, which takes into account the passage of time as tasks progress. Then, we propose a novel framework Dynamic Time-aware continual user representation learner, named DITTO, designed to alleviate catastrophic forgetting despite continuous shifts in item distribution, while also allowing the knowledge acquired from previous tasks to adapt to the current shifted item distribution. Through our extensive experiments, we demonstrate the superiority of DITTO over state-of-the-art methods under a practical evaluation scenario. Our source code is available at https://github.com/seungyoon-Choi/DITTO_official.
Chinese: 本文提出了一种实用的评估场景和名为DITTO的新型持续学习通用用户表示框架,通过动态适应时间引起的物品分布变化并减轻灾难性遗忘,有效解决了现有方法的局限性。
English: This paper introduces a practical evaluation scenario and a novel framework called DITTO for continual learning-based universal user representation, addressing limitations in existing methods by dynamically adapting to time-induced shifts in item distribution while mitigating catastrophic forgetting.

Authors:Yahao Lu, Yuehui Li, Xingyuan Guo, Shuai Yuan, Yukai Shi, Liang Lin
Title: Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning
Abstract:
Infrared small target detection (ISTD) is highly sensitive to sensor type, observation conditions, and the intrinsic properties of the target. These factors can introduce substantial variations in the distribution of acquired infrared image data, a phenomenon known as domain shift. Such distribution discrepancies significantly hinder the generalization capability of ISTD models across diverse scenarios. To tackle this challenge, this paper introduces an ISTD framework enhanced by domain adaptation. To alleviate distribution shift between datasets and achieve cross-sample alignment, we introduce Cross-view Channel Alignment (CCA). Additionally, we propose the Cross-view Top-K Fusion strategy, which integrates target information with diverse background features, enhancing the model' s ability to extract critical data characteristics. To further mitigate the impact of noise on ISTD, we develop a Noise-guided Representation learning strategy. This approach enables the model to learn more noise-resistant feature representations, to improve its generalization capability across diverse noisy domains. Finally, we develop a dedicated infrared small target dataset, RealScene-ISTD. Compared to state-of-the-art methods, our approach demonstrates superior performance in terms of detection probability (Pd), false alarm rate (Fa), and intersection over union (IoU). The code is available at: https://github.com/luy0222/RealScene-ISTD.
中文摘要:本文提出了一种增强领域适应的红外小目标检测框架,通过跨视图对齐和抗噪表征学习解决领域偏移问题,并在新构建的数据集上实现了更优的检测性能。
English Summary: This paper introduces a domain adaptation-enhanced framework for infrared small target detection that addresses domain shift through cross-view alignment and noise-resistant representation learning, achieving superior performance on a newly developed dataset.

Authors:Shun Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guojun Qi
Title: Cross Paradigm Representation and Alignment Transformer for Image Deraining
Abstract:
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.
Chinese: 提出的CPRAformer通过分层对齐和两种自注意力机制整合全局-局部与空间-通道表征,在多个基准测试中实现了最先进的图像去雨性能。
English: The proposed CPRAformer integrates global-local and spatial-channel representations through hierarchical alignment and two specialized self-attention mechanisms, achieving state-of-the-art image deraining performance across multiple benchmarks.

Authors:Ye Tian, Yanqiu Yu, Jianguo Sun, Yanbin Wang
Title: From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories
Abstract:
Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.
中文摘要:本文对恶意URL检测技术进行全面综述,提出基于模态的新分类法,并通过整理开源实现和数据集来解决现有研究的不足,为建立标准化基准提供支持。
English Summary: This paper provides a comprehensive review of malicious URL detection technologies, introducing a novel modality-based taxonomy and addressing critical gaps in existing research by curating open-source implementations and datasets to establish standardized benchmarking.

Authors:Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti
Title: POPri: Private Federated Learning using Preference-Optimized Synthetic Data
Abstract:
In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as an RL (reinforcement learning) reward. Our algorithm, Policy Optimization for Private Data (POPri) harnesses client feedback using policy optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri.
中文: POPri算法通过策略优化提升联邦学习中差分隐私合成数据的质量,在LargeFedBench等基准测试中将非隐私环境下的效用差距缩小高达58%。
English: The POPri algorithm leverages policy optimization to enhance differentially private synthetic data generation in federated learning, significantly narrowing the utility gap with non-private settings by up to 58% on benchmarks like LargeFedBench.

Authors:Hariseetharam Gunduboina, Muhammad Haris Khan, Biplab Banerjee
Title: FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
Abstract:
In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at https://github.com/HariseetharamG/FrogDogNet
Chinese: FrogDogNet提出了一种新颖的提示学习框架,通过傅里叶频率滤波和自注意力机制保留不变的低频特征并消除噪声,显著提升了遥感场景分类和领域泛化能力,在多个数据集和任务中持续优于现有最优方法。
English: FrogDogNet introduces a novel prompt learning framework that leverages Fourier frequency filtering and self-attention to enhance domain generalization in remote sensing by preserving invariant low-frequency features while eliminating noise, consistently outperforming state-of-the-art methods across multiple datasets and tasks.

Authors:Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang
Title: Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Abstract:
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.
中文: 本文提出MMLA基准,专门评估多模态大语言模型在六种语义维度上理解认知层面语义的能力,实验表明即使经过微调的模型准确率也仅达60%-70%,凸显出现有模型在理解复杂人类语言方面的局限。
English: This paper introduces MMLA, a comprehensive benchmark for evaluating multimodal large language models' ability to understand cognitive-level semantics across six dimensions, revealing current models' limitations with only 60%-70% accuracy despite extensive testing.

Authors:Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di
Title: LLMSR@XLLM25: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation
Abstract:
The LLMSR@XLLM25 formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/JhCircle/Less-is-More.
中文摘要:在LLMSR@XLLM25竞赛中获得第三名的“Less is More”方法,通过多智能体框架结合逆向提示诱导和奖励引导过滤,仅用24个标注样本就提升了结构化推理能力,证明了可控数据蒸馏在低资源条件下的有效性。
English Summary: The "Less is More" approach, which won third place in the LLMSR@XLLM25 competition, employs a multi-agent framework with reverse-prompt induction and reward-guided filtering to enhance structured reasoning using only 24 labeled examples, demonstrating the effectiveness of controllable data distillation under low-resource constraints.

Authors:Yuanjian Wang, Yufei Deng, Rong Xiao, Jiahao Fan, Chenwei Tang, Deng Xiong, Jiancheng Lv
Title: SaENeRF: Suppressing Artifacts in Event-based Neural Radiance Fields
Abstract:
Event cameras are neuromorphic vision sensors that asynchronously capture changes in logarithmic brightness changes, offering significant advantages such as low latency, low power consumption, low bandwidth, and high dynamic range. While these characteristics make them ideal for high-speed scenarios, reconstructing geometrically consistent and photometrically accurate 3D representations from event data remains fundamentally challenging. Current event-based Neural Radiance Fields (NeRF) methods partially address these challenges but suffer from persistent artifacts caused by aggressive network learning in early stages and the inherent noise of event cameras. To overcome these limitations, we present SaENeRF, a novel self-supervised framework that effectively suppresses artifacts and enables 3D-consistent, dense, and photorealistic NeRF reconstruction of static scenes solely from event streams. Our approach normalizes predicted radiance variations based on accumulated event polarities, facilitating progressive and rapid learning for scene representation construction. Additionally, we introduce regularization losses specifically designed to suppress artifacts in regions where photometric changes fall below the event threshold and simultaneously enhance the light intensity difference of non-zero events, thereby improving the visual fidelity of the reconstructed scene. Extensive qualitative and quantitative experiments demonstrate that our method significantly reduces artifacts and achieves superior reconstruction quality compared to existing methods. The code is available at https://github.com/Mr-firework/SaENeRF.
中文: 事件相机具有低延迟和高动态范围优势,但存在噪声和早期学习伪影问题;SaENeRF通过自监督的归一化处理和正则化损失,有效抑制伪影并实现高质量神经辐射场重建。
English: Event cameras offer low latency and high dynamic range but struggle with 3D reconstruction due to noise and early learning artifacts, which SaENeRF addresses through self-supervised normalization and regularization for superior, artifact-free NeRF results.

Authors:Fengchun Liu, Tong Zhang, Chunying Zhang
Title: CLPSTNet: A Progressive Multi-Scale Convolutional Steganography Model Integrating Curriculum Learning
Abstract:
In recent years, a large number of works have introduced Convolutional Neural Networks (CNNs) into image steganography, which transform traditional steganography methods such as hand-crafted features and prior knowledge design into steganography methods that neural networks autonomically learn information embedding. However, due to the inherent complexity of digital images, issues of invisibility and security persist when using CNN models for information embedding. In this paper, we propose Curriculum Learning Progressive Steganophy Network (CLPSTNet). The network consists of multiple progressive multi-scale convolutional modules that integrate Inception structures and dilated convolutions. The module contains multiple branching pathways, starting from a smaller convolutional kernel and dilatation rate, extracting the basic, local feature information from the feature map, and gradually expanding to the convolution with a larger convolutional kernel and dilatation rate for perceiving the feature information of a larger receptive field, so as to realize the multi-scale feature extraction from shallow to deep, and from fine to coarse, allowing the shallow secret information features to be refined in different fusion stages. The experimental results show that the proposed CLPSTNet not only has high PSNR , SSIM metrics and decoding accuracy on three large public datasets, ALASKA2, VOC2012 and ImageNet, but also the steganographic images generated by CLPSTNet have low steganalysis scores.You can find our code at \href{https://github.com/chaos-boops/CLPSTNet}{https://github.com/chaos-boops/CLPSTNet}.
Chinese: 本文提出CLPSTNet,一种基于课程学习的渐进式隐写网络,通过多尺度卷积模块实现从细到粗的特征提取,在多个数据集上展现出卓越的不可见性、安全性和性能指标。
English: This paper introduces CLPSTNet, a curriculum learning-based progressive steganography network that employs multi-scale convolutional modules to enhance feature extraction from fine to coarse, achieving superior invisibility, security, and performance metrics across multiple datasets.

Authors:Xuming Hu, Hanqian Li, Jungang Li, Yu Huang, Aiwei Liu
Title: VideoMark: A Distortion-Free Robust Watermarking Framework for Video Diffusion Models
Abstract:
This work introduces \textbf{VideoMark}, a distortion-free robust watermarking framework for video diffusion models. As diffusion models excel in generating realistic videos, reliable content attribution is increasingly critical. However, existing video watermarking methods often introduce distortion by altering the initial distribution of diffusion variables and are vulnerable to temporal attacks, such as frame deletion, due to variable video lengths. VideoMark addresses these challenges by employing a \textbf{pure pseudorandom initialization} to embed watermarks, avoiding distortion while ensuring uniform noise distribution in the latent space to preserve generation quality. To enhance robustness, we adopt a frame-wise watermarking strategy with pseudorandom error correction (PRC) codes, using a fixed watermark sequence with randomly selected starting indices for each video. For watermark extraction, we propose a Temporal Matching Module (TMM) that leverages edit distance to align decoded messages with the original watermark sequence, ensuring resilience against temporal attacks. Experimental results show that VideoMark achieves higher decoding accuracy than existing methods while maintaining video quality comparable to watermark-free generation. The watermark remains imperceptible to attackers without the secret key, offering superior invisibility compared to other frameworks. VideoMark provides a practical, training-free solution for content attribution in diffusion-based video generation. Code and data are available at \href{https://github.com/KYRIE-LI11/VideoMark}{https://github.com/KYRIE-LI11/VideoMark}{Project Page}.
Chinese: VideoMark 是一种无失真的鲁棒视频水印框架,采用纯伪随机初始化和帧间策略配合时序匹配模块,在保持视频质量的同时实现高解码精度,有效抵御时序攻击。
English: VideoMark is a distortion-free robust watermarking framework for video diffusion models that uses pure pseudorandom initialization and a frame-wise strategy with temporal matching to ensure high decoding accuracy and video quality while resisting temporal attacks.

Authors:Jiwan Kim, Hongseok Kang, Sein Kim, Kibum Kim, Chanyoung Park
Title: Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios
Abstract:
Multi-modal recommender systems (MRSs) have achieved notable success in improving personalization by leveraging diverse modalities such as images, text, and audio. However, two key challenges remain insufficiently addressed: (1) Insufficient consideration of missing modality scenarios and (2) the overlooking of unique characteristics of modality features. These challenges result in significant performance degradation in realistic situations where modalities are missing. To address these issues, we propose Disentangling and Generating Modality Recommender (DGMRec), a novel framework tailored for missing modality scenarios. DGMRec disentangles modality features into general and specific modality features from an information-based perspective, enabling richer representations for recommendation. Building on this, it generates missing modality features by integrating aligned features from other modalities and leveraging user modality preferences. Extensive experiments show that DGMRec consistently outperforms state-of-the-art MRSs in challenging scenarios, including missing modalities and new item settings as well as diverse missing ratios and varying levels of missing modalities. Moreover, DGMRec's generation-based approach enables cross-modal retrieval, a task inapplicable for existing MRSs, highlighting its adaptability and potential for real-world applications. Our code is available at https://github.com/ptkjw1997/DGMRec.
中文: 提出的DGMRec框架通过解构模态特征并生成缺失特征,有效解决了多模态推荐系统中的关键挑战,在数据不完整场景下表现优异,并实现了现有系统无法完成的跨模态检索功能。
English: The proposed DGMRec framework addresses key challenges in multi-modal recommender systems by disentangling modality features and generating missing ones, achieving superior performance in scenarios with incomplete data and enabling novel cross-modal retrieval capabilities.

Authors:André Longon
Title: Naturally Computed Scale Invariance in the Residual Stream of ResNet18
Abstract:
An important capacity in visual object recognition is invariance to image-altering variables which leave the identity of objects unchanged, such as lighting, rotation, and scale. How do neural networks achieve this? Prior mechanistic interpretability research has illuminated some invariance-building circuitry in InceptionV1, but the results are limited and networks with different architectures have remained largely unexplored. This work investigates ResNet18 with a particular focus on its residual stream, an architectural component which InceptionV1 lacks. We observe that many convolutional channels in intermediate blocks exhibit scale invariant properties, computed by the element-wise residual summation of scale equivariant representations: the block input's smaller-scale copy with the block pre-sum output's larger-scale copy. Through subsequent ablation experiments, we attempt to causally link these neural properties with scale-robust object recognition behavior. Our tentative findings suggest how the residual stream computes scale invariance and its possible role in behavior. Code is available at: https://github.com/cest-andre/residual-stream-interp
中文: 本研究揭示了ResNet18通过残差流整合多尺度等变表征来实现尺度不变性的机制,这种机制可能支撑了模型跨尺度物体识别的稳健表现。
English: This study explores how ResNet18's residual stream achieves scale invariance by combining scale-equivariant representations, potentially enabling robust object recognition across different sizes.

Authors:Henry Marichal, Verónica Casaravilla, Candice Power, Karolain Mello, Joaquín Mazarino, Christine Lucas, Ludmila Profumo, Diego Passarella, Gregory Randall
Title: DeepCS-TRD, a Deep Learning-based Cross-Section Tree Ring Detector
Abstract:
Here, we propose Deep CS-TRD, a new automatic algorithm for detecting tree rings in whole cross-sections. It substitutes the edge detection step of CS-TRD by a deep-learning-based approach (U-Net), which allows the application of the method to different image domains: microscopy, scanner or smartphone acquired, and species (Pinus taeda, Gleditsia triachantos and Salix glauca). Additionally, we introduce two publicly available datasets of annotated images to the community. The proposed method outperforms state-of-the-art approaches in macro images (Pinus taeda and Gleditsia triacanthos) while showing slightly lower performance in microscopy images of Salix glauca. To our knowledge, this is the first paper that studies automatic tree ring detection for such different species and acquisition conditions. The dataset and source code are available in https://github.com/hmarichal93/deepcstrd
中文摘要:Deep CS-TRD提出了一种基于U-Net深度学习的自动树木年轮检测算法,适用于多种图像类型和树种,在多数情况下优于现有方法,并公开了数据集和源代码。
English Summary: Deep CS-TRD introduces a deep learning-based algorithm using U-Net for automatic tree ring detection across multiple image types and species, outperforming existing methods in most cases while providing open datasets and code.

Authors:Obed Korshie Dzikunu, Amirhossein Toosi, Shadab Ahamed, Sara Harsini, Francois Benard, Xiaoxiao Li, Arman Rahmim
Title: Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images
Abstract:
This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.
中文: 本研究评估了深度学习分割方法在前列腺癌PET/CT扫描中的应用,发现采用L1DFL损失函数的Attention U-Net模型与临床测量值具有最佳相关性,同时显著降低了定量评估的变异性。
English: This study evaluates deep-learning segmentation methods for prostate cancer PET/CT scans, finding that Attention U-Net with the proposed L1DFL loss function achieves superior correlation with clinical measurements while reducing variability in quantitative assessments.

Authors:Martin Fleischmann, Anastassia Vybornova, James D. Gaboardi, Anna Brázdová, Daniela Dančejová
Title: Adaptive continuity-preserving simplification of street networks
Abstract:
Street network data is widely used to study human-based activities and urban structure. Often, these data are geared towards transportation applications, which require highly granular, directed graphs that capture the complex relationships of potential traffic patterns. While this level of network detail is critical for certain fine-grained mobility models, it represents a hindrance for studies concerned with the morphology of the street network. For the latter case, street network simplification - the process of converting a highly granular input network into its most simple morphological form - is a necessary, but highly tedious preprocessing step, especially when conducted manually. In this manuscript, we develop and present a novel adaptive algorithm for simplifying street networks that is both fully automated and able to mimic results obtained through a manual simplification routine. The algorithm - available in the neatnet Python package - outperforms current state-of-the-art procedures when comparing those methods to manually, human-simplified data, while preserving network continuity.
中文: 本手稿提出了一种新颖的自适应算法,能够自动简化街道网络,模拟人工简化结果,在保持网络连续性的同时优于现有方法。
English: This manuscript introduces a novel adaptive algorithm that automates the simplification of street networks, mimicking manual results and outperforming existing methods while maintaining network continuity.

Authors:Zexi Fan, Yan Sun, Shihao Yang, Yiping Lu
Title: Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning
Abstract:
High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at https://github.com/Francis-Fan-create/SCaSML.
中文:提出的SCaSML框架通过强制物理定律在推理过程中动态优化和校正科学机器学习预测,相比基础模型将高维偏微分方程求解误差降低了20-50%。
English: The proposed SCaSML framework dynamically refines and debiases scientific machine learning predictions during inference by enforcing physical laws, achieving 20-50% error reduction in solving high-dimensional PDEs compared to base models.

Authors:Jingchao Wang, Hong Wang, Wenlong Zhang, Kunhua Ji, Dingjiang Huang, Yefeng Zheng
Title: Progressive Language-guided Visual Learning for Multi-Task Visual Grounding
Abstract:
Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL
中文摘要:本文提出PLVL框架,通过渐进式语言引导视觉学习,无需额外跨模态交互模块即可实现语言与视觉的深度融合,并利用多任务头协同优化指称表达理解与分割任务,显著提升了性能。
English Summary: The paper introduces PLVL, a framework that progressively integrates language guidance into visual learning for multi-task visual grounding, eliminating the need for separate cross-modal fusion and enhancing collaborative predictions between referring expression comprehension and segmentation.

Authors:Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang
Title: MARFT: Multi-Agent Reinforcement Fine-Tuning
Abstract:
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new POMDP called Flex-POMDP, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
中文: 本文提出多智能体强化微调(MARFT)新范式,通过设计Flex-POMDP框架和开源实现,解决了多智能体强化学习在基于大语言模型的多智能体系统中应用的核心挑战,为开发自适应智能体系统提供了技术路线。
English: This article introduces Multi-Agent Reinforcement Fine-Tuning (MARFT), a novel paradigm that addresses the challenges of applying multi-agent reinforcement learning to LLM-based multi-agent systems by proposing a flexible POMDP framework and providing open-source implementation to advance adaptive agentic solutions.

Authors:Xingxing Zuo, Nikhil Ranganathan, Connor Lee, Georgia Gkioxari, Soon-Jo Chung
Title: MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation
Abstract:
Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88\% compared to the baseline without distillation.
中文摘要:本研究提出一种置信度感知蒸馏方法,通过从RGB单目深度估计模型迁移知识来提升热图像深度估计性能,在无标注热数据情况下将相对误差降低了22.88%。
English Summary: This study introduces a confidence-aware distillation method that transfers knowledge from RGB-based monocular depth estimation models to enhance thermal image depth estimation, achieving a 22.88% error reduction without requiring labeled thermal data.

Authors:Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, Han-Jia Ye
Title: Representation Learning for Tabular Data: A Comprehensive Survey
Abstract:
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.
中文摘要:表格数据是机器学习中最常用的数据类型之一,本文系统梳理了其表示学习方法,将模型分为专用、可迁移和通用三类,并探讨了相关应用与扩展方向。
English Summary: Tabular data is widely used in machine learning, and this survey systematically explores representation learning methods for it, categorizing models into specialized, transferable, and general types while discussing their applications and extensions.

Authors:Jiaxing Xu, Kai He, Yue Tang, Wei Li, Mengcheng Lan, Xia Dong, Yiping Ke, Mengling Feng
Title: BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification
Abstract:
Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model's capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework's ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience. The code is available at https://github.com/AngusMonroe/BrainPrompt
中文摘要:BrainPrompt是一种创新框架,通过将大型语言模型与知识驱动提示相结合来增强图神经网络,能够有效捕捉复杂的非成像信息并提高神经系统疾病诊断的可解释性。
English Summary: BrainPrompt is a novel framework that integrates Large Language Models with knowledge-driven prompts into Graph Neural Networks to enhance neurological disease diagnosis by capturing complex non-imaging information and improving interpretability.

Authors:Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
Title: TTRL: Test-Time Reinforcement Learning
Abstract:
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
本文提出了一种名为测试时强化学习(TTRL)的新方法,通过利用多数投票等测试时扩展技术进行奖励估计,使大语言模型能够在无标注测试数据上借助强化学习实现自我进化。
This paper introduces Test-Time Reinforcement Learning (TTRL), a novel method that enables large language models to self-improve using reinforcement learning on unlabeled test data by leveraging test-time scaling techniques like majority voting for reward estimation.

Authors:Ziqi Pang, Yu-Xiong Wang
Title: MR. Video: "MapReduce" is the Principle for Long Video Understanding
Abstract:
We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video
中文: MR. Video框架采用MapReduce原理处理长视频,通过独立感知短视频片段并联合整合信息,在LVBench上相比现有方法实现了超过10%的准确率提升。
English: MR. Video is an agentic framework that applies the MapReduce principle to long video understanding by independently perceiving short clips and jointly aggregating their information, achieving over 10% accuracy improvement on LVBench compared to existing methods.

Authors:Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu
Title: Survey of Video Diffusion Models: Foundations, Implementations, and Applications
Abstract:
Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.
中文: 扩散模型以卓越的时序一致性和视觉质量革新了视频生成领域,本综述系统性地梳理了其技术演进与方法体系,在剖析运动连贯性等挑战的同时,为研究者提供了涵盖评估指标与工程实践的完整资源库。
English: Diffusion models have revolutionized video generation with superior quality and temporal consistency, though challenges in motion coherence and efficiency persist, as this comprehensive survey systematically reviews their evolution, methodologies, and applications while offering an updated perspective on the field.

Authors:Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
Title: LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
Abstract:
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.
中文:LongMamba是一种无需训练的技术,通过识别并筛选全局通道中的关键令牌来缓解内存衰减,从而显著提升Mamba模型的长上下文理解能力,且无需额外训练。
English: LongMamba is a training-free technique that enhances Mamba models' long-context understanding by identifying and filtering critical tokens in global channels to mitigate memory decay, significantly improving performance without additional training.

Authors:Nicholas Julian Behr, Mattia Bianchi, Keith Moffat, Saverio Bolognani, Florian Dörfler
Title: PRIME: Fast Primal-Dual Feedback Optimization for Markets with Application to Optimal Power Flow
Abstract:
Online Feedback Optimization (OFO) controllers iteratively drive a plant to an optimal operating point that satisfies input and output constraints, relying solely on the input-output sensitivity as model information. This paper introduces PRIME (PRoximal Iterative MarkEts), a novel OFO approach based on proximal-point iterations. Unlike existing OFO solutions, PRIME admits a market-based implementation, where self-interested actors are incentivized to make choices that result in safe and efficient operation, without communicating private costs or constraints. Furthermore, PRIME can handle non-smooth objective functions, achieve fast convergence rates and rapid constraint satisfaction, and effectively reject measurement noise. We demonstrate PRIME on an AC optimal power flow problem, obtaining an efficient real-time nonlinear local marginal pricing scheme.
中文: PRIME是一种新颖的在线反馈优化方法,通过基于市场的实施实现安全高效运行而无需共享私有信息,能处理非光滑目标函数并具备快速收敛和抗噪能力,已在交流最优潮流问题中得到验证。
English: PRIME is a novel Online Feedback Optimization approach that enables market-based implementation for safe and efficient operation without sharing private information, while handling non-smooth objectives with fast convergence and noise rejection, as demonstrated in an AC optimal power flow application.

Authors:Song Wang, Xiaolu Liu, Lingdong Kong, Jianyun Xu, Chunyong Hu, Gongfan Fang, Wentong Li, Jianke Zhu, Xinchao Wang
Title: PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
Abstract:
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: https://github.com/songw-zju/PointLoRA.
中文: PointLoRA提出了一种参数高效的微调方法,结合低秩适应与多尺度令牌选择来优化点云模型,仅使用3.43%的可训练参数就能实现优异性能,同时提升全局和局部特征提取能力。
English: PointLoRA introduces a parameter-efficient fine-tuning method that integrates low-rank adaptation with multi-scale token selection to optimize point cloud models, achieving competitive performance using only 3.43% of trainable parameters while enhancing global and local feature extraction.

Authors:Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng
Title: FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Abstract:
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
Chinese: FreeGraftor提出了一种无需训练的框架,通过跨图像特征嫁接和创新的噪声初始化策略,在无需微调的情况下实现了卓越的主体保真度和文本对齐效果,有效解决了主体驱动图像生成中保真度与效率的权衡问题。
English: FreeGraftor introduces a training-free framework that overcomes the fidelity-efficiency trade-off in subject-driven image generation by employing cross-image feature grafting and a novel noise initialization strategy, achieving superior subject fidelity and text alignment without requiring fine-tuning.

Authors:Ekaterina Kondrateva, Sandzhi Barg, Mikhail Vasiliev
Title: Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time
Abstract:
Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: https://github.com/kondratevakate/brain-mri-segmentation
中文: 本研究对FastSurfer和SynthSeg流程进行基准测试,发现小型脑区存在高达7-8%的体积变异,强调在神经影像研究中需采用协调策略来提升分割可靠性。
English: This study benchmarks FastSurfer and SynthSeg pipelines, revealing up to 7-8% volume variation in small brain structures and emphasizing the need for harmonization strategies to improve segmentation reliability in neuroimaging.

Authors:Alycia Carey, Xintao Wu
Title: Achieving Distributive Justice in Federated Learning via Uncertainty Quantification
Abstract:
Client-level fairness metrics for federated learning are used to ensure that all clients in a federation either: a) have similar final performance on their local data distributions (i.e., client parity), or b) obtain final performance on their local data distributions relative to their contribution to the federated learning process (i.e., contribution fairness). While a handful of works that propose either client-parity or contribution-based fairness metrics ground their definitions and decisions in social theories of equality -- such as distributive justice -- most works arbitrarily choose what notion of fairness to align with which makes it difficult for practitioners to choose which fairness metric aligns best with their fairness ethics. In this work, we propose UDJ-FL (Uncertainty-based Distributive Justice for Federated Learning), a flexible federated learning framework that can achieve multiple distributive justice-based client-level fairness metrics. Namely, by utilizing techniques inspired by fair resource allocation, in conjunction with performing aleatoric uncertainty-based client weighing, our UDJ-FL framework is able to achieve egalitarian, utilitarian, Rawls' difference principle, or desert-based client-level fairness. We empirically show the ability of UDJ-FL to achieve all four defined distributive justice-based client-level fairness metrics in addition to providing fairness equivalent to (or surpassing) other popular fair federated learning works. Further, we provide justification for why aleatoric uncertainty weighing is necessary to the construction of our UDJ-FL framework as well as derive theoretical guarantees for the generalization bounds of UDJ-FL. Our code is publicly available at https://github.com/alycia-noel/UDJ-FL.
中文: 本研究提出的UDJ-FL框架通过基于偶然不确定性的客户端加权和公平资源分配技术,能够实现基于分配正义的多种客户端级公平指标,包括平等主义、功利主义、罗尔斯差异原则和应得公平,其公平性表现达到或超越了现有联邦学习方法。
English: The proposed UDJ-FL framework enables federated learning systems to achieve multiple client-level fairness metrics based on distributive justice principles, including egalitarian, utilitarian, Rawlsian, and desert-based fairness, while demonstrating empirical performance that matches or exceeds existing approaches.

Authors:Chang Zong, Bin Li, Shoujun Zhou, Jian Wan, Lei Zhang
Title: Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Abstract:
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.
中文: 本文提出In-VAL任务,模拟人与视频的交互过程,通过Ask2Loc框架解决语义鸿沟问题来定位教学片段,相比传统方法性能提升高达14.91 mIoU。
English: The paper introduces In-VAL, a task simulating human-video interactions to locate instructional segments by resolving semantic gaps through the Ask2Loc framework, which enhances performance by up to 14.91 mIoU over traditional methods.

Authors:Lotfi Abdelkrim Mecharbat, Ibrahim Almakky, Martin Takac, Mohammad Yaqub
Title: MedNNS: Supernet-based Medical Task-Adaptive Neural Network Search
Abstract:
Deep learning (DL) has achieved remarkable progress in the field of medical imaging. However, adapting DL models to medical tasks remains a significant challenge, primarily due to two key factors: (1) architecture selection, as different tasks necessitate specialized model designs, and (2) weight initialization, which directly impacts the convergence speed and final performance of the models. Although transfer learning from ImageNet is a widely adopted strategy, its effectiveness is constrained by the substantial differences between natural and medical images. To address these challenges, we introduce Medical Neural Network Search (MedNNS), the first Neural Network Search framework for medical imaging applications. MedNNS jointly optimizes architecture selection and weight initialization by constructing a meta-space that encodes datasets and models based on how well they perform together. We build this space using a Supernetwork-based approach, expanding the model zoo size by 51x times over previous state-of-the-art (SOTA) methods. Moreover, we introduce rank loss and Fréchet Inception Distance (FID) loss into the construction of the space to capture inter-model and inter-dataset relationships, thereby achieving more accurate alignment in the meta-space. Experimental results across multiple datasets demonstrate that MedNNS significantly outperforms both ImageNet pre-trained DL models and SOTA Neural Architecture Search (NAS) methods, achieving an average accuracy improvement of 1.7% across datasets while converging substantially faster. The code and the processed meta-space is available at https://github.com/BioMedIA-MBZUAI/MedNNS.
中文: 深度学习在医学影像领域取得显著进展,但模型适应仍具挑战;MedNNS通过构建元空间联合优化架构与权重初始化,以更高精度和更快收敛速度超越现有方法。
English: Deep learning has advanced medical imaging but faces challenges in model adaptation, which MedNNS addresses by jointly optimizing architecture and weight initialization through a meta-space, outperforming existing methods with higher accuracy and faster convergence.

Authors:Diego de Oliveira Hitzges, Suman Ghosh, Guillermo Gallego
Title: DERD-Net: Learning Depth from Event-based Ray Densities
Abstract:
Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net
Chinese: 本文提出了一种基于事件相机的新型深度学习框架,通过3D卷积和循环结构处理局部视差空间图像,在单目和立体设置中均实现最先进性能,同时保持恒定计算复杂度。
English: This paper introduces a novel deep learning framework for event-based depth estimation that processes local disparity space images with 3D convolutions and recurrent structures, achieving state-of-the-art performance in both monocular and stereo setups while maintaining constant computational complexity.

Authors:Lingxi Cui, Huan Li, Ke Chen, Lidan Shou, Gang Chen
Title: NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery
Abstract:
With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at https://github.com/SuDIS-ZJU/nlcTables to foster future research.
中文摘要:本文提出了结合查询表格与自然语言要求来优化表格搜索的NL条件表格发现新任务,并发布了揭示现有方法存在显著性能差距的综合基准数据集。
English Summary: This paper introduces a new task called NL-conditional table discovery (nlcTD) that combines query tables with natural language requirements to improve table search, and presents a comprehensive benchmark dataset revealing significant performance gaps in current methods.

Authors:Luwei Xiao, Rui Mao, Shuai Zhao, Qika Lin, Yanhao Jia, Liang He, Erik Cambria
Title: Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis
Abstract:
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera
Chinese: 该研究提出了Chimera框架,通过整合细粒度视觉特征和大语言模型的认知情感解释,提升了多模态方面级情感分类的性能,实验证明其效果优于GPT-4o等现有方法。
English: The study introduces Chimera, a framework that enhances multimodal aspect-based sentiment classification by integrating fine-grained visual features and cognitive-affective interpretations through large language models, demonstrating superior performance over existing methods like GPT-4o.

Authors:Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang
Title: WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
Abstract:
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL-free, model-based agent "WALL-E 2.0" through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%-51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.
中文: 本研究提出“世界对齐”方法,弥合大型语言模型先验知识与环境动态间的差距,使无强化学习智能体WALL-E 2.0能借助神经符号世界模型实现高效规划,在开放世界挑战中显著超越现有方法。
English: The study introduces "world alignment" to bridge the gap between LLMs' prior knowledge and environment dynamics, enabling the RL-free agent WALL-E 2.0 to leverage a neurosymbolic world model for efficient planning and significantly outperform existing methods in open-world challenges.

Authors:Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang
Title: TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Abstract:
Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83\% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen
中文: TrustGeoGen 是一个生成经过形式化验证的几何问题的数据引擎,通过创建GeoTrust-200K数据集和GeoTrust-test基准,有效应对大语言模型的幻觉问题,显著提升了模型在几何问题解决中的表现和泛化能力。
English: TrustGeoGen is a data engine that creates formally verified geometric problems to address LLM hallucinations, producing the GeoTrust-200K dataset and GeoTrust-test benchmark, which significantly challenge existing models and enhance their performance and generalization in geometric problem solving.

Authors:Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang
Title: TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Abstract:
Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83\% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen
中文: TrustGeoGen 是一个生成经过形式化验证的几何问题的数据引擎,通过创建GeoTrust-200K数据集和GeoTrust-test基准,有效应对大语言模型的幻觉问题,显著提升了模型在几何问题解决中的表现和泛化能力。
English: TrustGeoGen is a data engine that creates formally verified geometric problems to address LLM hallucinations, producing the GeoTrust-200K dataset and GeoTrust-test benchmark, which significantly challenge existing models and enhance their performance and generalization in geometric problem solving.

Authors:Lei Xu, Mehmet Yamac, Mete Ahishali, Moncef Gabbouj
Title: Multi-Scale Tensorial Summation and Dimensional Reduction Guided Neural Network for Edge Detection
Abstract:
Edge detection has attracted considerable attention thanks to its exceptional ability to enhance performance in downstream computer vision tasks. In recent years, various deep learning methods have been explored for edge detection tasks resulting in a significant performance improvement compared to conventional computer vision algorithms. In neural networks, edge detection tasks require considerably large receptive fields to provide satisfactory performance. In a typical convolutional operation, such a large receptive field can be achieved by utilizing a significant number of consecutive layers, which yields deep network structures. Recently, a Multi-scale Tensorial Summation (MTS) factorization operator was presented, which can achieve very large receptive fields even from the initial layers. In this paper, we propose a novel MTS Dimensional Reduction (MTS-DR) module guided neural network, MTS-DR-Net, for the edge detection task. The MTS-DR-Net uses MTS layers, and corresponding MTS-DR blocks as a new backbone to remove redundant information initially. Such a dimensional reduction module enables the neural network to focus specifically on relevant information (i.e., necessary subspaces). Finally, a weight U-shaped refinement module follows MTS-DR blocks in the MTS-DR-Net. We conducted extensive experiments on two benchmark edge detection datasets: BSDS500 and BIPEDv2 to verify the effectiveness of our model. The implementation of the proposed MTS-DR-Net can be found at https://github.com/LeiXuAI/MTS-DR-Net.git.
Chinese: 本文提出了一种新型的MTS-DR-Net神经网络,它采用多尺度张量求和降维模块来高效获取大感受野并消除冗余信息,在BSDS500和BIPEDv2数据集上的实验验证了其显著提升边缘检测性能的有效性。
English: This paper introduces MTS-DR-Net, a novel neural network that employs Multi-scale Tensorial Summation Dimensional Reduction modules to achieve large receptive fields efficiently and remove redundant information, significantly enhancing edge detection performance as validated on BSDS500 and BIPEDv2 datasets.

Authors:Manjunath D, Aniruddh Sikdar, Prajwal Gurunath, Sumanth Udupa, Suresh Sundaram
Title: SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems
Abstract:
Domain-adaptive thermal object detection plays a key role in facilitating visible (RGB)-to-thermal (IR) adaptation by reducing the need for co-registered image pairs and minimizing reliance on large annotated IR datasets. However, inherent limitations of IR images, such as the lack of color and texture cues, pose challenges for RGB-trained models, leading to increased false positives and poor-quality pseudo-labels. To address this, we propose Semantic-Aware Gray color Augmentation (SAGA), a novel strategy for mitigating color bias and bridging the domain gap by extracting object-level features relevant to IR images. Additionally, to validate the proposed SAGA for drone imagery, we introduce the IndraEye, a multi-sensor (RGB-IR) dataset designed for diverse applications. The dataset contains 5,612 images with 145,666 instances, captured from diverse angles, altitudes, backgrounds, and times of day, offering valuable opportunities for multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor-specific strengths and weaknesses. IndraEye aims to enhance the development of more robust and accurate aerial perception systems, especially in challenging environments. Experimental results show that SAGA significantly improves RGB-to-IR adaptation for autonomous driving and IndraEye dataset, achieving consistent performance gains of +0.4% to +7.6% (mAP) when integrated with state-of-the-art domain adaptation techniques. The dataset and codes are available at https://github.com/airliisc/IndraEye.
中文: 本文提出SAGA方法,通过语义感知的灰度色彩增强减少颜色偏差,有效弥合RGB到红外的域间差异,并在新型IndraEye数据集上验证了其对无人机图像域自适应目标检测的显著提升效果。
English: The paper introduces SAGA, a semantic-aware gray color augmentation method that effectively bridges the RGB-to-IR domain gap by reducing color bias and improving object detection, validated on the new IndraEye dataset which enhances multimodal aerial perception with significant performance gains.

Authors:Yannic Neuhaus, Matthias Hein
Title: RePOPE: Impact of Annotation Errors on the POPE Benchmark
Abstract:
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .
中文摘要:本研究评估了MSCOCO数据集中的标签错误对POPE物体幻觉基准的影响,通过重新标注数据创建RePOPE后发现,模型性能排名发生显著变化,凸显了标签质量的重要性。
English Summary: This study evaluates how label errors in the MSCOCO dataset affect the POPE object hallucination benchmark, revealing that re-annotating the data (RePOPE) significantly alters model performance rankings and underscores the importance of label quality.

Authors:Anjiang Wei, Huanmi Tan, Tarun Suresh, Daniel Mendoza, Thiago S. F. X. Teixeira, Ke Wang, Caroline Trippel, Alex Aiken
Title: VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation
Abstract:
Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of 125,777 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4%, respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/VeriCoder
Chinese: VERICODER是一个基于功能验证数据集微调的RTL代码生成模型,通过结合单元测试生成和反馈导向优化的新方法,在功能正确性方面达到了最先进的性能。
English: VERICODER is a model fine-tuned on a functionally validated RTL code generation dataset, achieving state-of-the-art performance in functional correctness through a novel methodology combining unit test generation and feedback-directed refinement.

Authors:Yunfeng Li, Bo Wang, Jiahao Wan, Xueyi Wu, Ye Li
Title: SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking
Abstract:
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
中文: 为解决水下声学目标跟踪缺乏统一基准的问题,本研究提出了首个大规模基准SonarT165和高效框架STFTrack,该框架通过多视图模板融合和最优轨迹校正等创新模块,实现了最先进的性能。
English: To address the lack of a unified benchmark for underwater acoustic object tracking, this study introduces SonarT165, the first large-scale benchmark, and proposes STFTrack, an efficient framework with novel modules that achieves state-of-the-art performance by integrating multi-view features and correcting trajectories.

Authors:Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang
Title: Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Abstract:
The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.
Chinese: WebR 是一种自动化框架,通过“网页作为指令”和“网页作为响应”的双重视角范式,直接从原始网页文档中合成高质量的指令微调数据,在四项基准测试中性能最高提升16.65%,并展现出卓越的兼容性和可扩展性。
English: WebR is an automated framework that synthesizes high-quality instruction-response pairs from raw web documents through a dual-perspective paradigm, significantly outperforming existing methods by up to 16.65% across benchmarks while demonstrating superior compatibility and scalability.

Authors:Zheyuan Gu, Chang Liu, Xiyuan Zhang, Chen Yang, Gaopeng Gou, Gang Xiong, Zhen Li, Sijia Li
Title: DecETT: Accurate App Fingerprinting Under Encrypted Tunnels via Dual Decouple-based Semantic Enhancement
Abstract:
Due to the growing demand for privacy protection, encrypted tunnels have become increasingly popular among mobile app users, which brings new challenges to app fingerprinting (AF)-based network management. Existing methods primarily transfer traditional AF methods to encrypted tunnels directly, ignoring the core obfuscation and re-encapsulation mechanism of encrypted tunnels, thus resulting in unsatisfactory performance. In this paper, we propose DecETT, a dual decouple-based semantic enhancement method for accurate AF under encrypted tunnels. Specifically, DecETT improves AF under encrypted tunnels from two perspectives: app-specific feature enhancement and irrelevant tunnel feature decoupling.Considering the obfuscated app-specific information in encrypted tunnel traffic, DecETT introduces TLS traffic with stronger app-specific information as a semantic anchor to guide and enhance the fingerprint generation for tunnel traffic. Furthermore, to address the app-irrelevant tunnel feature introduced by the re-encapsulation mechanism, DecETT is designed with a dual decouple-based fingerprint enhancement module, which decouples the tunnel feature and app semantic feature from tunnel traffic separately, thereby minimizing the impact of tunnel features on accurate app fingerprint extraction. Evaluation under five prevalent encrypted tunnels indicates that DecETT outperforms state-of-the-art methods in accurate AF under encrypted tunnels, and further demonstrates its superiority under tunnels with more complicated obfuscation. \textit{Project page: \href{https://github.com/DecETT/DecETT}{https://github.com/DecETT/DecETT}}
Chinese: 本文提出DecETT方法,通过增强应用特定特征和解耦无关隧道特征的双重解耦机制,有效提升了加密隧道环境下的应用指纹识别性能,显著优于现有技术。
English: This paper introduces DecETT, a dual decouple-based semantic enhancement method that improves app fingerprinting under encrypted tunnels by enhancing app-specific features and decoupling irrelevant tunnel features, achieving superior performance over existing approaches.

Authors:Zizhi Chen, Xinyu Zhang, Minghao Han, Yizhou Liu, Ziyun Qian, Weifeng Zhang, Xukun Zhang, Jingwei Wei, Lihua Zhang
Title: VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining
Abstract:
In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR
中文摘要:本研究首次在虚拟染色任务中引入病理视觉语言大模型,通过对比学习提示和概念锚点整合,显著提升图像真实感及下游病理检测的准确性。
English Summary: This study introduces a pathological vision-language model to enhance virtual staining by integrating contrastive prompts and concept anchors, improving realism and accuracy in medical imaging tasks.

Authors:Yixuan Zhu, Haolin Wang, Ao Li, Wenliang Zhao, Yansong Tang, Jingxuan Niu, Lei Chen, Jie Zhou, Jiwen Lu
Title: InstaRevive: One-Step Image Enhancement via Dynamic Score Matching
Abstract:
Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at https://github.com/EternalEvan/InstaRevive.
中文:InstaRevive是一种高效的图像增强框架,通过基于分数的扩散蒸馏技术结合动态控制和文本提示,在减少采样步骤的同时仍能生成高质量的视觉效果。
English: InstaRevive is an efficient image enhancement framework that uses score-based diffusion distillation with dynamic control and text prompts to reduce sampling steps while maintaining high-quality generative results.

Authors:Eammon A. Littler, Emmanuel A. Mannoh, Ethan P. M. LaRochelle
Title: Fluorescence Reference Target Quantitative Analysis Library
Abstract:
Standardized performance evaluation of fluorescence imaging systems remains a critical unmet need in the field of fluorescence-guided surgery (FGS). While the American Association of Physicists in Medicine (AAPM) TG311 report and recent FDA draft guidance provide recommended metrics for system characterization, practical tools for extracting these metrics remain limited, inconsistent, and often inaccessible. We present QUEL-QAL, an open-source Python library designed to streamline and standardize the quantitative analysis of fluorescence images using solid reference targets. The library provides a modular, reproducible workflow that includes region of interest (ROI) detection, statistical analysis, and visualization capabilities. QUEL-QAL supports key metrics such as response linearity, limit of detection, depth sensitivity, and spatial resolution, in alignment with regulatory and academic guidance. Built on widely adopted Python packages, the library is designed to be extensible, enabling users to adapt it to novel target designs and analysis protocols. By promoting transparency, reproducibility, and regulatory alignment, QUEL-QAL offers a foundational tool to support standardized benchmarking and accelerate the development and evaluation of fluorescence imaging systems.
Chinese: QUEL-QAL 是一个开源 Python 库,通过提供模块化工作流程来标准化荧光成像系统的定量分析,解决了荧光引导手术领域对统一评估工具的迫切需求。
English: QUEL-QAL is an open-source Python library that standardizes the quantitative analysis of fluorescence imaging systems by providing modular workflows for key performance metrics, addressing the unmet need for consistent evaluation tools in fluorescence-guided surgery.

Authors:Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Title: CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Abstract:
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe
中文摘要:CAPTURe任务通过评估视觉语言模型对遮挡物后方图案化物体的计数能力,发现即使先进模型也难以进行空间推理,而人类表现近乎完美。
English Summary: The CAPTURe task evaluates vision-language models' ability to count patterned objects behind occlusions, revealing that even advanced models struggle with spatial reasoning about hidden objects while humans perform nearly flawlessly.

Authors:Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
Title: Learning Adaptive Parallel Reasoning with Language Models
Abstract:
Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.
中文:自适应并行推理(APR)框架通过强化学习让语言模型动态协调串行与并行计算,相比现有方法在性能、可扩展性和准确性方面均实现显著提升。
English: The proposed Adaptive Parallel Reasoning (APR) framework enables language models to dynamically orchestrate serial and parallel computations through reinforcement learning, achieving superior performance, scalability, and accuracy compared to existing methods.

Authors:David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin
Title: IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
Abstract:
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.
中文: 现有的多模态大模型评估主要关注图像推理或通用视频理解任务,忽视了图像上下文在视频理解中的重要作用,因此提出了首个全面的图像基础视频感知与推理基准IV-Bench,发现当前先进模型在此任务上表现严重不足,最高准确率仅达28.9%,并揭示了影响性能的关键因素。
English: Current multimodal models are evaluated primarily on image reasoning or general video tasks, neglecting the role of image context in video understanding, so IV-Bench is introduced as the first comprehensive benchmark for image-grounded video perception and reasoning, revealing that state-of-the-art models significantly underperform with at most 28.9% accuracy and highlighting key influencing factors.

Authors:Tajamul Ashraf, Rajes Manna, Partha Sarathi Purkayastha, Tavaheed Tariq, Janibul Bashir
Title: Context Aware Grounded Teacher for Source Free Object Detection
Abstract:
We focus on the Source Free Object Detection (SFOD) problem, when source data is unavailable during adaptation, and the model must adapt to the unlabeled target domain. In medical imaging, several approaches have leveraged a semi-supervised student-teacher architecture to bridge domain discrepancy. Context imbalance in labeled training data and significant domain shifts between domains can lead to biased teacher models that produce inaccurate pseudolabels, degrading the student model's performance and causing a mode collapse. Class imbalance, particularly when one class significantly outnumbers another, leads to contextual bias. To tackle the problem of context bias and the significant performance drop of the student model in the SFOD setting, we introduce Grounded Teacher (GT) as a standard framework. In this study, we model contextual relationships using a dedicated relational context module and leverage it to mitigate inherent biases in the model. This approach enables us to apply augmentations to closely related classes, across and within domains, enhancing the performance of underrepresented classes while keeping the effect on dominant classes minimal. We further improve the quality of predictions by implementing an expert foundational branch to supervise the student model. We validate the effectiveness of our approach in mitigating context bias under the SFOD setting through experiments on three medical datasets supported by comprehensive ablation studies. All relevant resources, including preprocessed data, trained model weights, and code, are publicly available at this https://github.com/Tajamul21/Grounded_Teacher.
中文摘要:Grounded Teacher框架通过建模上下文关系并实施针对性增强,解决了源自由目标检测中的上下文偏见和性能下降问题,有效提升弱势类别表现,同时对主导类别影响极小。
English Summary: The Grounded Teacher framework addresses context bias and performance degradation in Source Free Object Detection by modeling contextual relationships and applying targeted augmentations to improve underrepresented class performance while maintaining minimal impact on dominant classes.

Authors:Wei Fang, Priyadarshini Panda
Title: Event2Vec: Processing neuromorphic events directly by representations in vector space
Abstract:
The neuromorphic event cameras have overwhelming advantages in temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, the event cameras output asynchronous, sparse, and irregular events, which are not compatible with mainstream computer vision and deep learning methods. Various methods have been proposed to solve this issue but at the cost of long preprocessing procedures, losing temporal resolutions, or being incompatible with massively parallel computation. Inspired by the great success of the word to vector, we summarize the similarities between words and events, then propose the first event to vector (event2vec) representation. We validate event2vec on classifying the ASL-DVS dataset, showing impressive parameter efficiency, accuracy, and speed than previous graph/image/voxel-based representations. Beyond task performance, the most attractive advantage of event2vec is that it aligns events to the domain of natural language processing, showing the promising prospect of integrating events into large language and multimodal models. Our codes, models, and training logs are available at https://github.com/fangwei123456/event2vec.
中文摘要:event2vec方法提出了一种新颖的表征方式,使神经网络能够直接处理神经形态事件数据,不仅实现了高参数效率和吞吐量,还为与大型语言模型的集成开辟了新途径。
English Summary: The event2vec method introduces a novel representation that enables neural networks to process neuromorphic event data directly, offering high parameter efficiency and throughput while paving the way for integration with large language models.

Authors:Wei Fang, Priyadarshini Panda
Title: Event2Vec: Processing Neuromorphic Events directly by Representations in Vector Space
Abstract:
Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Existing solutions to this incompatibility often sacrifice temporal resolution, require extensive pre-processing, and do not fully leverage GPU acceleration. Inspired by word-to-vector models, we draw an analogy between words and events to introduce event2vec, a novel representation that allows neural networks to process events directly. This approach is fully compatible with the parallel processing and self-supervised learning capabilities of Transformer architectures. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks. A comprehensive ablation study further analyzes our method's features and contrasts them with existing representations. The experimental results show that event2vec is remarkably parameter-efficient, has high throughput, and can achieve high accuracy even with an extremely low number of events. Beyond its performance, the most significant contribution of event2vec is a new paradigm that enables neural networks to process event streams as if they were natural language. This paradigm shift paves the way for the native integration of event cameras with large language models and multimodal models. Code, model, and training logs are provided in https://github.com/Intelligent-Computing-Lab-Panda/event2vec.
中文摘要:event2vec方法提出了一种新颖的表征方式,使神经网络能够直接处理神经形态事件数据,不仅实现了高参数效率和吞吐量,还为与大型语言模型的集成开辟了新途径。
English Summary: The event2vec method introduces a novel representation that enables neural networks to process neuromorphic event data directly, offering high parameter efficiency and throughput while paving the way for integration with large language models.

Authors:Qifan Yan, Andrew Liu, Shiqi He, Mathias Lécuyer, Ivan Beschastnikh
Title: FedFetch: Faster Federated Learning with Adaptive Downstream Prefetching
Abstract:
Federated learning (FL) is a machine learning paradigm that facilitates massively distributed model training with end-user data on edge devices directed by a central server. However, the large number of heterogeneous clients in FL deployments leads to a communication bottleneck between the server and the clients. This bottleneck is made worse by straggling clients, any one of which will further slow down training. To tackle these challenges, researchers have proposed techniques like client sampling and update compression. These techniques work well in isolation but combine poorly in the downstream, server-to-client direction. This is because unselected clients have outdated local model states and need to synchronize these states with the server first. We introduce FedFetch, a strategy to mitigate the download time overhead caused by combining client sampling and compression techniques. FedFetch achieves this with an efficient prefetch schedule for clients to prefetch model states multiple rounds before a stated training round. We empirically show that adding FedFetch to communication efficient FL techniques reduces end-to-end training time by 1.26$\times$ and download time by 4.49$\times$ across compression techniques with heterogeneous client settings. Our implementation is available at https://github.com/DistributedML/FedFetch
联邦学习面临异构客户端导致的通信瓶颈,结合客户端采样和压缩技术会加剧此问题,但FedFetch通过预取模型状态减少了下载时间,将训练时间缩短1.26倍,下载时间降低4.49倍。
Federated learning faces communication bottlenecks from heterogeneous clients, which are worsened by combining client sampling and compression, but FedFetch reduces download time by prefetching model states, cutting training time by 1.26x and download time by 4.49x.

Authors:Yike Zhang, Eduardo Davalos, Jack Noble
Title: Vision6D: 3D-to-2D Interactive Visualization and Annotation Tool for 6D Pose Estimation
Abstract:
Accurate 6D pose estimation has gained more attention over the years for robotics-assisted tasks that require precise interaction with physical objects. This paper presents an interactive 3D-to-2D visualization and annotation tool to support the 6D pose estimation research community. To the best of our knowledge, the proposed work is the first tool that allows users to visualize and manipulate 3D objects interactively on a 2D real-world scene, along with a comprehensive user study. This system supports robust 6D camera pose annotation by providing both visual cues and spatial relationships to determine object position and orientation in various environments. The annotation feature in Vision6D is particularly helpful in scenarios where the transformation matrix between the camera and world objects is unknown, as it enables accurate annotation of these objects' poses using only the camera intrinsic matrix. This capability serves as a foundational step in developing and training advanced pose estimation models across various domains. We evaluate Vision6D's effectiveness by utilizing widely-used open-source pose estimation datasets Linemod and HANDAL through comparisons between the default ground-truth camera poses with manual annotations. A user study was performed to show that Vision6D generates accurate pose annotations via visual cues in an intuitive 3D user interface. This approach aims to bridge the gap between 2D scene projections and 3D scenes, offering an effective way for researchers and developers to solve 6D pose annotation related problems. The software is open-source and publicly available at https://github.com/InteractiveGL/vision6D.
中文: 本文介绍了Vision6D这一交互式3D到2D可视化标注工具,它通过让用户在真实场景中操控3D物体来实现精确的6D姿态估计,并附有用户研究和开源发布。
English: This paper introduces Vision6D, an interactive 3D-to-2D visualization and annotation tool that enables precise 6D pose estimation by allowing users to manipulate 3D objects in real-world scenes, supported by a user study and open-source availability.

Authors:Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, Fei-Yue Wang
Title: Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
Abstract:
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. Code and models are available at https://github.com/CJReinforce/PURE.
中文: 过程奖励模型(PRM)在强化微调中因传统求和形式信用分配易导致奖励破解,但提出的PURE方法采用最小值形式信用分配有效缓解此问题,显著提升推理性能和模型准确率。
English: Process reward models (PRMs) face reward hacking issues in reinforcement fine-tuning due to the conventional summation-form credit assignment, but the proposed PURE method with min-form credit assignment effectively mitigates this problem, achieving superior reasoning performance and model accuracy.

Authors:Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Abstract:
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
中文摘要:本研究设计了一套最小算法任务来评估语言模型的创造性局限,发现多标记方法在生成多样性输出上优于单标记学习,且输入层噪声注入在平衡随机性与连贯性方面不亚于温度采样。
English Summary: This study introduces minimal algorithmic tasks to evaluate the creative limitations of language models, demonstrating that multi-token approaches outperform next-token learning in generating diverse outputs and that input-layer noise injection rivals temperature sampling for balancing randomness and coherence.

Authors:Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Abstract:
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
中文摘要:本研究设计了一套最小算法任务来评估语言模型的创造性局限,发现多标记方法在生成多样性输出上优于单标记学习,且输入层噪声注入在平衡随机性与连贯性方面不亚于温度采样。
English Summary: This study introduces minimal algorithmic tasks to evaluate the creative limitations of language models, demonstrating that multi-token approaches outperform next-token learning in generating diverse outputs and that input-layer noise injection rivals temperature sampling for balancing randomness and coherence.

Authors:Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang
Title: FlowReasoner: Reinforcing Query-Level Meta-Agents
Abstract:
This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail-sg/FlowReasoner.
中文摘要:本文提出FlowReasoner查询级元代理,通过融合DeepSeek R1的推理能力并采用带多维奖励的强化学习,为每个用户查询自动生成个性化多智能体系统,在工程和竞赛基准测试中表现卓越,准确率较o1-mini提升10.52%。
English Summary: This paper introduces FlowReasoner, a query-level meta-agent that automates the design of personalized multi-agent systems for each user query by leveraging reasoning abilities distilled from DeepSeek R1 and enhanced through reinforcement learning with a multi-purpose reward function, achieving superior performance including a 10.52% accuracy gain over o1-mini.

Authors:Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig
Title: CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Abstract:
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
Chinese: CRUST-Bench是一个包含100个C语言仓库及对应安全Rust接口与测试用例的数据集,用于评估C到Rust的转译系统,结果表明现有方法在生成安全地道的Rust代码方面仍面临挑战,最优模型仅能完成15项任务。
English: CRUST-Bench is a dataset of 100 C repositories with safe Rust interfaces and test cases, designed to evaluate C-to-Rust transpilation systems, revealing that current methods struggle with generating safe, idiomatic Rust code as even the best model solved only 15 tasks.

Authors:Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig
Title: CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Abstract:
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
Chinese: CRUST-Bench是一个包含100个C语言仓库及对应安全Rust接口与测试用例的数据集,用于评估C到Rust的转译系统,结果表明现有方法在生成安全地道的Rust代码方面仍面临挑战,最优模型仅能完成15项任务。
English: CRUST-Bench is a dataset of 100 C repositories with safe Rust interfaces and test cases, designed to evaluate C-to-Rust transpilation systems, revealing that current methods struggle with generating safe, idiomatic Rust code as even the best model solved only 15 tasks.

Authors:Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty
Title: Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Abstract:
Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.
中文: JETTS基准测试表明,尽管LLM评判模型在重排序任务中与结果奖励模型表现相当,但在束搜索中不如过程奖励模型,且其自然语言评析目前无法有效指导生成模型改进回答。
English: The JETTS benchmark reveals that while LLM-judges are competitive with outcome reward models in reranking tasks, they underperform process reward models in beam search and their natural language critiques currently fail to effectively guide generators toward improved responses.

Authors:Xiaoyu Han, Shunyuan Zheng, Zonglin Li, Chenyang Wang, Xin Sun, Quanling Meng
Title: Shape-Guided Clothing Warping for Virtual Try-On
Abstract:
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/SCW-VTON.
中文摘要:提出的SCW-VTON方法通过整合全局形状约束和肢体纹理重建,有效提升了虚拟试衣中服装与身体的贴合度并减少肢体区域变形,在定性和定量评估中均优于现有方法。
English Summary: The proposed SCW-VTON method enhances virtual clothing try-on by incorporating global shape constraints and limb texture reconstruction to improve clothing-body alignment and reduce limb distortions, outperforming existing techniques.

Authors:Sarah Alnegheimish, Zelin He, Matthew Reimherr, Akash Chandrayan, Abhinav Pradhan, Luca D'Angelo
Title: M$^2$AD: Multi-Sensor Multi-System Anomaly Detection through Global Scoring and Calibrated Thresholding
Abstract:
With the widespread availability of sensor data across industrial and operational systems, we frequently encounter heterogeneous time series from multiple systems. Anomaly detection is crucial for such systems to facilitate predictive maintenance. However, most existing anomaly detection methods are designed for either univariate or single-system multivariate data, making them insufficient for these complex scenarios. To address this, we introduce M$^2$AD, a framework for unsupervised anomaly detection in multivariate time series data from multiple systems. M$^2$AD employs deep models to capture expected behavior under normal conditions, using the residuals as indicators of potential anomalies. These residuals are then aggregated into a global anomaly score through a Gaussian Mixture Model and Gamma calibration. We theoretically demonstrate that this framework can effectively address heterogeneity and dependencies across sensors and systems. Empirically, M$^2$AD outperforms existing methods in extensive evaluations by 21% on average, and its effectiveness is demonstrated on a large-scale real-world case study on 130 assets in Amazon Fulfillment Centers. Our code and results are available at https://github.com/sarahmish/M2AD.
中文: 本文提出M²AD无监督异常检测框架,通过深度学习建模正常行为并利用高斯混合模型聚合残差,有效处理多系统多元时间序列的异质性问题,在实验中比现有方法平均性能提升21%。
English: This paper introduces M²AD, an unsupervised anomaly detection framework that addresses heterogeneity in multivariate time series from multiple systems by modeling normal behavior with deep learning and aggregating residuals through Gaussian Mixture Models, achieving 21% better performance than existing methods.

Authors:Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples
Title: Histogram-based Parameter-efficient Tuning for Passive Sonar Classification
Abstract:
Parameter-efficient transfer learning (PETL) methods adapt large artificial neural networks to downstream tasks without fine-tuning the entire model. However, existing additive methods, such as adapters, sometimes struggle to capture distributional shifts in intermediate feature embeddings. We propose a novel histogram-based parameter-efficient tuning (HPT) technique that captures the statistics of the target domain and modulates the embeddings. Experimental results on three downstream passive sonar datasets (ShipsEar, DeepShip, VTUAD) demonstrate that HPT outperforms conventional adapters. Notably, HPT achieves 91.8% vs. 89.8% accuracy on VTUAD. Furthermore, HPT trains faster and yields feature representations closer to those of fully fine-tuned models. Overall, HPT balances parameter savings and performance, providing a distribution-aware alternative to existing adapters and shows a promising direction for scalable transfer learning in resource-constrained environments. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/HLAST_DeepShip_ParameterEfficient.
中文: 提出的基于直方图的参数高效调优(HPT)方法能有效捕捉目标域统计特征来调节嵌入表示,在被动声纳数据集上以更少参数实现了比传统适配器更高的准确率和训练效率。
English: The proposed Histogram-based Parameter-efficient Tuning (HPT) method effectively captures target domain statistics to modulate embeddings, outperforming conventional adapters in accuracy and training efficiency on passive sonar datasets while maintaining parameter efficiency.

Authors:Andy Wanna, Hanqiu Chen, Cong Hao
Title: ForgeBench: A Machine Learning Benchmark Suite and Auto-Generation Framework for Next-Generation HLS Tools
Abstract:
Although High-Level Synthesis (HLS) has attracted considerable interest in hardware design, it has not yet become mainstream due to two primary challenges. First, current HLS hardware design benchmarks are outdated as they do not cover modern machine learning (ML) applications, preventing the rigorous development of HLS tools on ML-focused hardware design. Second, existing HLS tools are outdated because they predominantly target individual accelerator designs and lack an architecture-oriented perspective to support common hardware module extraction and reuse, limiting their adaptability and broader applicability. Motivated by these two limitations, we propose ForgeBench, an ML-focused benchmark suite with a hardware design auto-generation framework for next-generation HLS tools. In addition to the auto-generation framework, we provide two ready-to-use benchmark suites. The first contains over 6,000 representative ML HLS designs. We envision future HLS tools being architecture-oriented, capable of automatically identifying common computational modules across designs, and supporting flexible dataflow and control. Accordingly, the second benchmark suite includes ML HLS designs with possible resource sharing manually implemented to highlight the necessity of architecture-oriented design, ensuring it is future-HLS ready. ForgeBench is open-sourced at https://github.com/hchen799/ForgeBench .
中文: 高级综合(HLS)面临两大挑战:基准测试未涵盖现代机器学习应用且工具缺乏架构导向设计,为此我们开发了开源基准套件ForgeBench,具备自动生成框架以支持下一代HLS工具发展。
English: High-Level Synthesis (HLS) faces two key challenges: outdated benchmarks that exclude modern machine learning applications and tools lacking architecture-oriented design, prompting the development of ForgeBench, an open-source benchmark suite with auto-generation capabilities to support future HLS tools.

Authors:Chengxi Han, Xiaoyu Su, Zhiqiang Wei, Meiqi Hu, Yichu Xu
Title: HSANET: A Hybrid Self-Cross Attention Network For Remote Sensing Change Detection
Abstract:
The remote sensing image change detection task is an essential method for large-scale monitoring. We propose HSANet, a network that uses hierarchical convolution to extract multi-scale features. It incorporates hybrid self-attention and cross-attention mechanisms to learn and fuse global and cross-scale information. This enables HSANet to capture global context at different scales and integrate cross-scale features, refining edge details and improving detection performance. We will also open-source our model code: https://github.com/ChengxiHAN/HSANet.
Chinese: HSANet提出了一种结合混合自注意力和交叉注意力的分层网络,用于提升遥感变化检测中的多尺度特征融合和边缘细节,代码已开源在GitHub上。
English: HSANet introduces a hierarchical network with hybrid self-attention and cross-attention to enhance multi-scale feature fusion and edge detail in remote sensing change detection, with code available on GitHub.

Authors:Juyeon Kim, Geon Lee, Taeuk Kim, Kijung Shin
Title: KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking
Abstract:
Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.
中文: 本文提出KGMEL框架,通过生成、检索和重排三阶段整合知识图谱三元组来提升多模态实体链接的准确性,实验证明其优于现有方法。
English: The paper introduces KGMEL, a multimodal entity linking framework that enhances accuracy by incorporating knowledge-graph triples through generation, retrieval, and reranking stages, outperforming existing methods in experiments.

Authors:Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Abstract:
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
中文:EasyEdit2是一个即插即用的框架,通过测试时干预让用户无需深厚技术知识即可轻松控制大型语言模型的行为,仅需一个示例即可精确调整模型响应。
English: EasyEdit2 is a plug-and-play framework that enables users to easily control Large Language Model behaviors through test-time interventions, requiring minimal technical knowledge and allowing precise adjustments with just a single example.

Authors:Yiqian Yang
Title: NeuGaze: Reshaping the future BCI
Abstract:
Traditional brain-computer interfaces (BCIs), reliant on costly electroencephalography or invasive implants, struggle with complex human-computer interactions due to setup complexity and limited precision. We present NeuGaze, a novel webcam-based system that leverages eye gaze, head movements, and facial expressions to enable intuitive, real-time control using only a standard 30 Hz webcam, often pre-installed in laptops. Requiring minimal calibration, NeuGaze achieves performance comparable to conventional inputs, supporting precise cursor navigation, key triggering via an efficient skill wheel, and dynamic gaming interactions, such as defeating formidable opponents in first-person games. By harnessing preserved neck-up functionalities in motor-impaired individuals, NeuGaze eliminates the need for specialized hardware, offering a low-cost, accessible alternative to BCIs. This paradigm empowers diverse applications, from assistive technology to entertainment, redefining human-computer interaction for motor-impaired users. Project is at \href{https://github.com/NeuSpeech/NeuGaze}{github.com/NeuSpeech/NeuGaze}.
中文: NeuGaze 提出了一种基于网络摄像头的系统,利用眼球注视、头部动作和面部表情实现直观的实时控制,为传统脑机接口提供了无需专用硬件的低成本替代方案。
English: NeuGaze introduces a webcam-based system using eye gaze, head movements, and facial expressions for intuitive, real-time control, offering a low-cost alternative to traditional BCIs without specialized hardware.

Authors:Louis Bradshaw, Simon Colton
Title: Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling
Abstract:
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.
中文: 我们通过多阶段处理流程,包括自动抓取、评分和分割音频,创建了一个包含超过一百万MIDI文件的大型数据集,并提供了详细的技术分析和元数据信息。
English: We present a large-scale dataset of over one million MIDI files transcribed from piano audio recordings using a multi-stage pipeline that includes automated crawling, scoring, and segmentation, along with detailed analysis and metadata.

Authors:Minjin Choi, Sunkyung Lee, Seongmin Park, Jongwuk Lee
Title: Linear Item-Item Model with Neural Knowledge for Session-based Recommendation
Abstract:
Session-based recommendation (SBR) aims to predict users' subsequent actions by modeling short-term interactions within sessions. Existing neural models primarily focus on capturing complex dependencies for sequential item transitions. As an alternative solution, linear item-item models mainly identify strong co-occurrence patterns across items and support faster inference speed. Although each paradigm has been actively studied in SBR, their fundamental differences in capturing item relationships and how to bridge these distinct modeling paradigms effectively remain unexplored. In this paper, we propose a novel SBR model, namely Linear Item-Item model with Neural Knowledge (LINK), which integrates both types of knowledge into a unified linear framework. Specifically, we design two specialized components of LINK: (i) Linear knowledge-enhanced Item-item Similarity model (LIS), which refines the item similarity correlation via self-distillation, and (ii) Neural knowledge-enhanced Item-item Transition model (NIT), which seamlessly incorporates complicated neural knowledge distilled from the off-the-shelf neural model. Extensive experiments demonstrate that LINK outperforms state-of-the-art linear SBR models across six real-world datasets, achieving improvements of up to 14.78% and 11.04% in Recall@20 and MRR@20 while showing up to 813x fewer inference FLOPs. Our code is available at https://github.com/jin530/LINK.
中文: 本文提出LINK模型,通过自蒸馏技术将线性物品共现模式与复杂神经知识相融合,在显著降低计算成本的同时实现了更优的会话推荐性能。
English: The paper introduces LINK, a novel session-based recommendation model that integrates linear item-item co-occurrence patterns with complex neural knowledge through self-distillation, achieving superior performance with significantly reduced computational requirements.

Authors:Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Title: RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.
Chinese: RainbowPlus是一种基于进化计算的新型红队框架,通过自适应质量多样性搜索生成多样化的对抗性提示,在多个大语言模型和数据集上显著超越了现有方法的攻击成功率与效率。
English: RainbowPlus is an innovative red-teaming framework using evolutionary computation to generate diverse and effective adversarial prompts, significantly outperforming existing methods in attack success rate and efficiency across multiple LLMs and datasets.

Authors:Shilin Zhang, Zican Hu, Wenhao Wu, Xinyi Xie, Jianxiang Tang, Chunlin Chen, Daoyi Dong, Yu Cheng, Zhenhong Sun, Zhi Wang
Title: Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision
Abstract:
Offline meta-RL usually tackles generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose \textbf{T}ext-to-\textbf{D}ecision \textbf{A}gent (\textbf{T2DA}), a simple and scalable framework that supervises offline meta-RL with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. Our code is available at https://github.com/NJU-RL/T2DA.
中文摘要:本文提出T2DA框架,通过对比学习将自然语言与决策嵌入对齐,用文本监督离线元强化学习,实现零样本的文本到决策生成。
English Summary: The paper introduces T2DA, a framework that uses natural language to supervise offline meta-reinforcement learning, enabling zero-shot text-to-decision generation by aligning text and decision embeddings through contrastive pre-training.

Authors:Shiben Liu, Huijie Fan, Qiang Wang, Baojie Fan, Yandong Tang, Liangqiong Qu
Title: Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification
Abstract:
Lifelong Person Re-identification (LReID) suffers from a key challenge in preserving old knowledge while adapting to new information. The existing solutions include rehearsal-based and rehearsal-free methods to address this challenge. Rehearsal-based approaches rely on knowledge distillation, continuously accumulating forgetting during the distillation process. Rehearsal-free methods insufficiently learn the distribution of each domain, leading to forgetfulness over time. To solve these issues, we propose a novel Distribution-aware Forgetting Compensation (DAFC) model that explores cross-domain shared representation learning and domain-specific distribution integration without using old exemplars or knowledge distillation. We propose a Text-driven Prompt Aggregation (TPA) that utilizes text features to enrich prompt elements and guide the prompt model to learn fine-grained representations for each instance. This can enhance the differentiation of identity information and establish the foundation for domain distribution awareness. Then, Distribution-based Awareness and Integration (DAI) is designed to capture each domain-specific distribution by a dedicated expert network and adaptively consolidate them into a shared region in high-dimensional space. In this manner, DAI can consolidate and enhance cross-domain shared representation learning while alleviating catastrophic forgetting. Furthermore, we develop a Knowledge Consolidation Mechanism (KCM) that comprises instance-level discrimination and cross-domain consistency alignment strategies to facilitate model adaptive learning of new knowledge from the current domain and promote knowledge consolidation learning between acquired domain-specific distributions, respectively. Experimental results show that our DAFC outperforms state-of-the-art methods. Our code is available at https://github.com/LiuShiBen/DAFC.
中文摘要:提出的分布感知遗忘补偿(DAFC)模型通过文本驱动提示聚合和领域分布感知,在不依赖旧样本的情况下增强跨领域表征学习并缓解灾难性遗忘,从而解决终身行人重识别的关键挑战。
English Summary: The proposed Distribution-aware Forgetting Compensation (DAFC) model addresses lifelong person re-identification challenges by integrating text-driven prompt aggregation and domain distribution awareness to enhance cross-domain representation learning while mitigating catastrophic forgetting without relying on old exemplars.

Authors:Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
Title: DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation
Abstract:
Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis. The code is released in https://github.com/XiaoBuL/DyST-XL.
Chinese: DyST-XL 是一种无需训练的框架,通过整合动态布局规划、双提示注意力控制和实体一致性约束,提升文本到视频模型在合成具有精确时空关系的复杂视频方面的性能。
English: DyST-XL is a training-free framework that enhances text-to-video models by integrating dynamic layout planning, dual-prompt attention control, and entity-consistency constraints to improve compositional video generation with precise spatial-temporal relationships.

Authors:Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie
Title: Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
中文摘要:本研究提出包含101万问题和33万图像的细粒度评估基准FG-BMK,通过对12个大型视觉语言模型的系统评估,揭示了训练范式对性能影响的关键发现,为未来模型发展提供了重要指导。
English Summary: This study introduces FG-BMK, a comprehensive fine-grained evaluation benchmark with 1.01 million questions and 0.33 million images, systematically assessing twelve LVLMs to reveal critical insights about training paradigms and performance limitations for advancing future model development.

Authors:Qianyu Zhu, Junjie Wang, Jeremiah Hu, Jia Ai, Yong Lee
Title: PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV
Abstract:
Deep learning algorithms have significantly reduced the computational time and improved the spatial resolution of particle image velocimetry~(PIV). However, the models trained on synthetic datasets might have a degraded performance on practical particle images due to domain gaps. As a result, special residual patterns are often observed for the vector fields of deep learning-based estimators. To reduce the special noise step-by-step, we employ a denoising diffusion model~(FlowDiffuser) for PIV analysis. And the data-hungry iterative denoising diffusion model is trained via a transfer learning strategy, resulting in our PIV-FlowDiffuser method. Specifically, (1) pre-training a FlowDiffuser model with multiple optical flow datasets of the computer vision community, such as Sintel, KITTI, etc; (2) fine-tuning the pre-trained model on synthetic PIV datasets. Note that the PIV images are upsampled by a factor of two to resolve the small-scale turbulent flow structures. The visualized results indicate that our PIV-FlowDiffuser effectively suppresses the noise patterns. Therefore, the denoising diffusion model reduces the average end-point error~($AEE$) by 59.4% over RAFT256-PIV baseline on the classic Cai's dataset. Besides, PIV-FlowDiffuser exhibits enhanced generalization performance on unseen particle images due to transfer learning. Overall, this study highlights the transfer-learning-based denoising diffusion models for PIV. And a detailed implementation is recommended for interested readers in the repository https://github.com/Zhu-Qianyu/PIV-FlowDiffuser.
中文: 本研究提出PIV-FlowDiffuser,一种基于迁移学习的去噪扩散模型,能有效降低粒子图像测速中的噪声,将平均端点误差减少59.4%,并提升对实际粒子图像的泛化性能。
English: This study introduces PIV-FlowDiffuser, a denoising diffusion model trained via transfer learning that effectively reduces noise in particle image velocimetry, cutting the average endpoint error by 59.4% and improving generalization on real-world particle images.

Authors:Geng Li, Jinglin Xu, Yunzhen Zhao, Yuxin Peng
Title: DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Abstract:
Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025
中文:Dyfo是一种无需训练的视觉搜索方法,通过蒙特卡洛树搜索动态聚焦关键区域,无需额外模块或数据即可提升大型多模态模型的细粒度理解能力。
English: Dyfo is a training-free visual search method that enhances fine-grained understanding in large multimodal models by dynamically focusing on key regions using Monte Carlo Tree Search, improving performance without extra modules or data.

Authors:Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue
Title: OmniAudio: Generating Spatial Audio from 360-Degree Video
Abstract:
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio-360V2SA.github.io.
Chinese: 传统视频转音频方法缺乏空间线索,本研究提出360V2SA新任务,通过OmniAudio框架从360度视频生成空间音频,在Sphere360数据集上取得了最优性能。
English: Traditional video-to-audio methods lack spatial cues, so this study introduces 360V2SA, a novel task using the OmniAudio framework to generate spatial audio from 360-degree videos, achieving state-of-the-art results on the Sphere360 dataset.

Authors:Yingming Zheng, Xiaoliang Liu, Peng Wu, Li Pan
Title: CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs
Abstract:
The rapid spread of misinformation, driven by digital media and AI-generated content, has made automatic claim verification essential. Traditional methods, which depend on expert-annotated evidence, are labor-intensive and not scalable. Although recent automated systems have improved, they still struggle with complex claims that require nuanced reasoning. To address this, we propose CRAVE, a Conflicting Reasoning Approach for explainable claim VErification, that verify the complex claims based on the conflicting rationales reasoned by large language models (LLMs). Specifically, CRAVE introduces a three-module framework. Ambiguity Elimination enchanced Evidence Retrieval module performs ambiguity elimination and entity-based search to gather relevant evidence related to claim verification from external sources like Wikipedia. Conflicting Perspective Reasoning and Preliminary Judgment module with LLMs adopts LLMs to reason rationales with conflicting stances about claim verification from retrieved evidence across four dimensions, i.e., direct evidence, semantic relationships, linguistic patterns, and logical reasoning and make a preliminary judgment. Finally, Small Language Model (SLM) based Judge module is fine-tuned to make use of preliminary judgment from LLMs to assess the confidence of the conflicting rationales and make a final authenticity judgment. This methodology allows CRAVE to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. Extensive experiments on two public claim verification datasets demonstrate that our CRAVE model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for finding relevant evidence and explaining the model predictions. The code is provided at https://github.com/8zym/CRAVE.
中文摘要:针对传统声明验证方法的局限性,我们提出CRAVE框架,利用大语言模型生成对立推理依据,并通过微调的小语言模型进行最终判定,显著提升了复杂声明验证的准确性与可解释性。
English Summary: To address the limitations of traditional claim verification methods, we propose CRAVE, a framework that leverages large language models to generate conflicting rationales and uses a fine-tuned small language model for final judgment, significantly improving accuracy and transparency in verifying complex claims.

Authors:Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu
Title: Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Abstract:
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.
中文: Uni3C是一个统一的3D增强框架,通过即插即用的控制模块和联合对齐的3D世界引导,实现了视频生成中相机与人体运动的精准控制,在控制能力和运动质量上均显著优于现有方法。
English: Uni3C is a unified 3D-enhanced framework that achieves precise control of both camera and human motion in video generation through a plug-and-play control module and jointly aligned 3D world guidance, outperforming existing methods in controllability and motion quality.

Authors:Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Shuai Zhang, Yi Li
Title: Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions
Abstract:
This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and control method for multi-AUV task execution. Extensive experimental evaluations in the underwater data collection task demonstrate the system's operational feasibility, with quantitative results showing significant performance improvements over baseline methods. The proposed system exhibits robust coordination capabilities between USV and AUVs while maintaining stability in extreme sea conditions. To facilitate reproducibility and community advancement, we provide an open-source simulation toolkit available at: https://github.com/360ZMEM/USV-AUV-colab .
中文: 本文提出了一种新型无人船与自主水下航行器协同系统,通过优化路径规划和强化学习控制方法,在极端海况下显著提升了水下任务性能,实验评估验证了其优越性。
English: This paper introduces a novel USV-AUV collaborative system that enhances underwater task performance in extreme conditions through optimized path planning and reinforcement learning-based control, demonstrating significant improvements in experimental evaluations.

Authors:Aihua Zheng, Yongqi Sun, Zi Wang, Chenglong Li, Jin Tang
Title: Collaborative Enhancement Network for Low-quality Multi-spectral Vehicle Re-identification
Abstract:
The performance of multi-spectral vehicle Re-identification (ReID) is significantly degraded when some important discriminative cues in visible, near infrared and thermal infrared spectra are lost. Existing methods generate or enhance missing details in low-quality spectra data using the high-quality one, generally called the primary spectrum, but how to justify the primary spectrum is a challenging problem. In addition, when the quality of the primary spectrum is low, the enhancement effect would be greatly degraded, thus limiting the performance of multi-spectral vehicle ReID. To address these problems, we propose the Collaborative Enhancement Network (CoEN), which generates a high-quality proxy from all spectra data and leverages it to supervise the selection of primary spectrum and enhance all spectra features in a collaborative manner, for robust multi-spectral vehicle ReID. First, to integrate the rich cues from all spectra data, we design the Proxy Generator (PG) to progressively aggregate multi-spectral features. Second, we design the Dynamic Quality Sort Module (DQSM), which sorts all spectra data by measuring their correlations with the proxy, to accurately select the primary spectra with the highest correlation. Finally, we design the Collaborative Enhancement Module (CEM) to effectively compensate for missing contents of all spectra by collaborating the primary spectra and the proxy, thereby mitigating the impact of low-quality primary spectra. Extensive experiments on three benchmark datasets are conducted to validate the efficacy of the proposed approach against other multi-spectral vehicle ReID methods. The codes will be released at https://github.com/yongqisun/CoEN.
Chinese: 提出的协同增强网络(CoEN)通过生成高质量代理来指导主光谱选择并协同增强所有光谱特征,有效应对多光谱车辆重识别中数据质量变化带来的挑战。
English: The proposed Collaborative Enhancement Network (CoEN) addresses multi-spectral vehicle ReID challenges by generating a high-quality proxy to guide primary spectrum selection and collaboratively enhance all spectral features, improving robustness against data quality variations.

Authors:Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark
Title: ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
Abstract:
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.
Chinese: ReSpec框架通过基于模态对齐、任务相关性、特异性和处理效率实时筛选视频文本数据,显著提升了在线学习效率,在仅使用少量数据和计算的情况下,于零样本视频检索任务中实现了最优性能。
English: The ReSpec framework enhances online learning efficiency by filtering video-text data in real-time based on modality alignment, task relevance, specificity, and processing efficiency, achieving top performance on zero-shot retrieval tasks with minimal data and computation.

Authors:Yiming Luo, Yunfei Wang, Hongming Chen, Chengkai Wu, Ximin Lyu, Jinni Zhou, Jun Ma, Fu Zhang, Boyu Zhou
Title: FERMI: Flexible Radio Mapping with a Hybrid Propagation Model and Scalable Autonomous Data Collection
Abstract:
Communication is fundamental for multi-robot collaboration, with accurate radio mapping playing a crucial role in predicting signal strength between robots. However, modeling radio signal propagation in large and occluded environments is challenging due to complex interactions between signals and obstacles. Existing methods face two key limitations: they struggle to predict signal strength for transmitter-receiver pairs not present in the training set, while also requiring extensive manual data collection for modeling, making them impractical for large, obstacle-rich scenarios. To overcome these limitations, we propose FERMI, a flexible radio mapping framework. FERMI combines physics-based modeling of direct signal paths with a neural network to capture environmental interactions with radio signals. This hybrid model learns radio signal propagation more efficiently, requiring only sparse training data. Additionally, FERMI introduces a scalable planning method for autonomous data collection using a multi-robot team. By increasing parallelism in data collection and minimizing robot travel costs between regions, overall data collection efficiency is significantly improved. Experiments in both simulation and real-world scenarios demonstrate that FERMI enables accurate signal prediction and generalizes well to unseen positions in complex environments. It also supports fully autonomous data collection and scales to different team sizes, offering a flexible solution for creating radio maps. Our code is open-sourced at https://github.com/ymLuo1214/Flexible-Radio-Mapping.
中文:FERMI是一个灵活的无线电映射框架,通过结合基于物理的建模与神经网络,仅需稀疏数据即可在复杂环境中准确预测信号强度,同时其可扩展的自主数据收集方法显著提高了多机器人团队的效率。
English: FERMI is a flexible radio mapping framework that combines physics-based modeling with neural networks to accurately predict signal strength in complex environments using sparse data, while its scalable autonomous data collection method significantly improves efficiency for multi-robot teams.

Authors:Qiushi Xiong, Zhipeng Xu, Zhenghao Liu, Mengjia Wang, Zulong Chen, Yue Sun, Yu Gu, Xiaohua Li, Ge Yu
Title: Enhancing the Patent Matching Capability of Large Language Models via the Memory Graph
Abstract:
Intellectual Property (IP) management involves strategically protecting and utilizing intellectual assets to enhance organizational innovation, competitiveness, and value creation. Patent matching is a crucial task in intellectual property management, which facilitates the organization and utilization of patents. Existing models often rely on the emergent capabilities of Large Language Models (LLMs) and leverage them to identify related patents directly. However, these methods usually depend on matching keywords and overlook the hierarchical classification and categorical relationships of patents. In this paper, we propose MemGraph, a method that augments the patent matching capabilities of LLMs by incorporating a memory graph derived from their parametric memory. Specifically, MemGraph prompts LLMs to traverse their memory to identify relevant entities within patents, followed by attributing these entities to corresponding ontologies. After traversing the memory graph, we utilize extracted entities and ontologies to improve the capability of LLM in comprehending the semantics of patents. Experimental results on the PatentMatch dataset demonstrate the effectiveness of MemGraph, achieving a 17.68% performance improvement over baseline LLMs. The further analysis highlights the generalization ability of MemGraph across various LLMs, both in-domain and out-of-domain, and its capacity to enhance the internal reasoning processes of LLMs during patent matching. All data and codes are available at https://github.com/NEUIR/MemGraph.
中文: 本文提出MemGraph方法,通过利用记忆图谱识别专利中的实体和本体,显著提升大型语言模型在专利匹配中的性能,实验显示其比基线模型性能提高了17.68%。
English: This paper introduces MemGraph, a method that enhances patent matching in LLMs by using a memory graph to identify entities and ontologies, achieving a 17.68% performance improvement over baselines.

Authors:Ryu Tadokoro, Tsukasa Takagi, Shin-ichi Maeda
Title: Segmentation with Noisy Labels via Spatially Correlated Distributions
Abstract:
In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data includes label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. Bayesian inference requires computing the posterior distribution of label errors, which becomes intractable when spatial correlations are present. We represent the correlation of label errors between adjacent pixels through a Gaussian distribution whose covariance is structured by a Kac-Murdock-Szegö (KMS) matrix, solving the computational challenges. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.
Chinese Summary: 该研究提出了一种基于概率模型的贝叶斯估计方法,用于处理语义分割中空间相关的标注误差,在医学影像和遥感等任务中显著提升了模型性能。
English Summary: The study introduces a Bayesian estimation method using a probabilistic model that accounts for spatially correlated label errors in semantic segmentation, significantly enhancing model performance across tasks like medical imaging and remote sensing.

Authors:Sirui Zeng, Xifeng Yan
Title: ADL: A Declarative Language for Agent-Based Chatbots
Abstract:
There are numerous frameworks capable of creating and orchestrating agents to address complex tasks. However, most of them highly coupled Python programming with agent declaration, making it hard for maintenance and runtime optimization. In this work, we introduce ADL, an agent declarative language for customer service chatbots. ADL abstracts away implementation details, offering a declarative way to define agents and their interactions, which could ease maintenance and debugging. It also incorporates natural language programming at its core to simplify the specification and communication of chatbot designs. ADL includes four basic types of agents and supports integration with custom functions, tool use, and third-party agents. MICA, a multi-agent system designed to interpret and execute ADL programs, has been developed and is now available as an open-source project at https://github.com/Mica-labs/MICA. Its documentation can be found at https://mica-labs.github.io/.
中文: 本文提出ADL这一声明式语言,通过抽象实现细节简化客服聊天机器人中的智能体定义与交互,提升可维护性并融合自然语言编程核心功能。
English: This paper presents ADL, a declarative language that abstracts implementation details to simplify agent definition and interaction in customer service chatbots, enhancing maintainability and incorporating natural language programming.

Authors:Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang
Title: How Effective Can Dropout Be in Multiple Instance Learning ?
Abstract:
Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at https://github.com/ChongQingNoSubway/MILDropout.
Chinese: 本文提出MIL-Dropout新方法,通过选择性丢弃包内最重要的实例来提升多示例学习性能,在噪声环境下仍能增强泛化能力且计算成本可忽略。
English: This paper introduces MIL-Dropout, a novel method that improves Multiple Instance Learning (MIL) performance by selectively dropping top-k important instances, enhancing generalization even under noise with minimal computational cost.

Authors:Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas, Yuki Mitsufuji
Title: DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions
Abstract:
This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for ``Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for parameter estimation. Vocal presets are retrieved from two datasets, comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong relationships between effects and parameters, such as the high-pass and low-shelf filters often working together to shape the low end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to McAdams' timbre dimensions, where the most crucial component modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms the non-Gaussian nature of the parameter distribution, highlighting the complexity of the vocal effects space. These initial findings on the parameter distributions set the foundation for future research in vocal effects modelling and automatic mixing. Our source code and datasets are accessible at https://github.com/SonyResearch/diffvox.
中文摘要:本研究提出了DiffVox这一可解释的声效匹配模型,通过可微分实现整合多种音频效果,实现了基于梯度的参数优化,并通过全面分析揭示了参数间的显著关联与音色维度关系。
English Summary: This research presents DiffVox, an interpretable model for vocal effects matching that combines multiple audio effects with differentiable implementations, enabling gradient-based optimization and revealing significant parameter correlations and timbral relationships through comprehensive analysis.

Authors:Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki
Title: TAPIP3D: Tracking Any Point in Persistent 3D Geometry
Abstract:
We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons. To handle the irregular structure of 3D point distributions, we propose a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism - a 3D-aware contextualization strategy that builds informative, spatially coherent feature neighborhoods to support precise trajectory estimation. Our 3D-centric formulation significantly improves performance over existing 3D point tracking methods and even surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. The model supports inference in both camera-centric (unstabilized) and world-centric (stabilized) coordinates, with experiments showing that compensating for camera motion leads to substantial gains in tracking robustness. By replacing the conventional 2D square correlation windows used in prior 2D and 3D trackers with a spatially grounded 3D attention mechanism, TAPIP3D achieves strong and consistent results across multiple 3D point tracking benchmarks. Project Page: https://tapip3d.github.io
中文: TAPIP3D提出了一种新颖的3D点跟踪方法,通过深度和相机运动信息将视频特征稳定至三维空间,并采用3D邻域注意力机制实现长期鲁棒跟踪,在深度数据可靠时其性能超越现有3D与2D跟踪器。
English: TAPIP3D introduces a novel 3D point tracking method that stabilizes video features in 3D space using depth and camera motion, employing a 3D N2N attention mechanism to achieve robust long-term tracking and surpass both 3D and 2D trackers in performance when depth data is reliable.

Authors:Yeoreum Lee, Jinwook Jung, Sungyong Baik
Title: Mitigating Parameter Interference in Model Merging via Sharpness-Aware Fine-Tuning
Abstract:
Large-scale deep learning models with a pretraining-finetuning paradigm have led to a surge of numerous task-specific models fine-tuned from a common pre-trained model. Recently, several research efforts have been made on merging these large models into a single multi-task model, particularly with simple arithmetic on parameters. Such merging methodology faces a central challenge: interference between model parameters fine-tuned on different tasks. Few recent works have focused on designing a new fine-tuning scheme that can lead to small parameter interference, however at the cost of the performance of each task-specific fine-tuned model and thereby limiting that of a merged model. To improve the performance of a merged model, we note that a fine-tuning scheme should aim for (1) smaller parameter interference and (2) better performance of each fine-tuned model on the corresponding task. In this work, we aim to design a new fine-tuning objective function to work towards these two goals. In the course of this process, we find such objective function to be strikingly similar to sharpness-aware minimization (SAM) objective function, which aims to achieve generalization by finding flat minima. Drawing upon our observation, we propose to fine-tune pre-trained models via sharpness-aware minimization. The experimental and theoretical results showcase the effectiveness and orthogonality of our proposed approach, improving performance upon various merging and fine-tuning methods. Our code is available at https://github.com/baiklab/SAFT-Merge.
Chinese Summary: 本研究提出了一种利用锐度感知最小化的新微调方法,通过减少参数干扰和提升各任务模型的性能,有效增强了合并多任务模型的整体表现。
English Summary: This research introduces a new fine-tuning method using sharpness-aware minimization to enhance the performance of merged multi-task models by reducing parameter interference and improving task-specific model efficacy.

Authors:Binjie Guo, Hanyu Zheng, Guowei Su, Ru Zhang, Haohan Jiang, Xurong Lin, Hongyan Wei, Aisheng Mo, Jie Li, Zhiyuan Qian, Zhuhao Zhang, Xiaoyuan Cheng
Title: AlphaZero-Edu: Making AlphaZero Accessible to Everyone
Abstract:
Recent years have witnessed significant progress in reinforcement learning, especially with Zero-like paradigms, which have greatly boosted the generalization and reasoning abilities of large-scale language models. Nevertheless, existing frameworks are often plagued by high implementation complexity and poor reproducibility. To tackle these challenges, we present AlphaZero-Edu, a lightweight, education-focused implementation built upon the mathematical framework of AlphaZero. It boasts a modular architecture that disentangles key components, enabling transparent visualization of the algorithmic processes. Additionally, it is optimized for resource-efficient training on a single NVIDIA RTX 3090 GPU and features highly parallelized self-play data generation, achieving a 3.2-fold speedup with 8 processes. In Gomoku matches, the framework has demonstrated exceptional performance, achieving a consistently high win rate against human opponents. AlphaZero-Edu has been open-sourced at https://github.com/StarLight1212/AlphaZero_Edu, providing an accessible and practical benchmark for both academic research and industrial applications.
Chinese: AlphaZero-Edu 是一个轻量级模块化框架,通过简化实现并在有限硬件上实现高效训练,提升了强化学习在教育领域的应用,在五子棋等游戏中表现出色。
English: AlphaZero-Edu is a lightweight, modular framework that enhances reinforcement learning for education by simplifying implementation and enabling efficient training on limited hardware, achieving high performance in games like Gomoku.

Authors:Haiyan Qin, Jiahao Feng, Xiaotong Feng, Wei W. Xing, Wang Kang
Title: Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence
Abstract:
Large language models (LLMs) have transformed code generation, yet their application in hardware design produces gate counts 38\%--1075\% higher than human designs. We present CircuitMind, a multi-agent framework that achieves human-competitive efficiency through three key innovations: syntax locking (constraining generation to basic logic gates), retrieval-augmented generation (enabling knowledge-driven design), and dual-reward optimization (balancing correctness with efficiency). To evaluate our approach, we introduce TC-Bench, the first gate-level benchmark harnessing collective intelligence from the TuringComplete ecosystem -- a competitive circuit design platform with hundreds of thousands of players. Experiments show CircuitMind enables 55.6\% of model implementations to match or exceed top-tier human experts in composite efficiency metrics. Most remarkably, our framework elevates the 14B Phi-4 model to outperform both GPT-4o mini and Gemini 2.0 Flash, achieving efficiency comparable to the top 25\% of human experts without requiring specialized training. These innovations establish a new paradigm for hardware optimization where collaborative AI systems leverage collective human expertise to achieve optimal circuit designs. Our model, data, and code are open-source at https://github.com/BUAA-CLab/CircuitMind.
中文: CircuitMind通过语法锁定、检索增强生成和双奖励优化三大创新,构建了一个多智能体框架,使AI系统无需专门训练即可在电路设计效率上达到或超越顶尖人类专家的水平。
English: CircuitMind introduces a multi-agent framework that achieves human-competitive circuit design efficiency through syntax locking, retrieval-augmented generation, and dual-reward optimization, enabling AI systems to match or surpass top-tier human experts without specialized training.

Authors:Zhenkui Yang, Zeyi Huang, Ge Wang, Han Ding, Tony Xiao Han, Fei Wang
Title: Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts
Abstract:
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.
中文:提出的WiTalk框架通过集成语义文本提示来增强无线人体感知,无需修改架构或增加数据成本,即在多个数据集上显著提升了识别准确率。
English: The proposed WiTalk framework enhances wireless human sensing by integrating semantic text prompts, significantly improving accuracy across multiple datasets without architectural changes or extra data costs.

Authors:Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou
Title: MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
Abstract:
Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at https://github.com/JiaoSiyi/MPMat.git}.
中文: MP-Mat通过结合三维感知和实例级别的多重平面表示框架,有效提升了复杂场景下人体实例抠图的精确度,在抠图和图像编辑任务中均展现出卓越性能。
English: MP-Mat introduces a dual multiplane framework combining 3D-aware and instance-level representations to improve human instance matting accuracy in complex scenarios, demonstrating superior performance in both matting and image editing tasks.

Authors:Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: UFO2: The Desktop AgentOS
Abstract:
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
中文:UFO2提出了一种面向Windows的多智能体操作系统,通过深度系统集成、专用代理和混合控制检测技术,显著提升了桌面自动化的鲁棒性和执行精度,优于现有系统。
English: UFO2 introduces a multiagent AgentOS for Windows that enhances desktop automation through deep OS integration, specialized agents, and hybrid control detection, significantly improving robustness and execution accuracy over previous systems.

Authors:Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Jianxing Zhang, Jinlong Wu, Jun Wang, Zheng Xie, Hakjae Jeon, Suejin Han, Hyung-Ju Chun, Hyunhee Park, Zhicun Yin, Junjie Chen, Ming Liu, Xiaoming Li, Chao Zhou, Wangmeng Zuo, Weixia Zhang, Dingquan Li, Kede Ma, Yun Zhang, Zhuofan Zheng, Yuyue Liu, Shizhen Tang, Zihao Zhang, Yi Ning, Hao Jiang, Wenjie An, Kangmeng Yu, Chenyang Wang, Kui Jiang, Xianming Liu, Junjun Jiang, Yingfu Zhang, Gang He, Siqi Wang, Kepeng Xu, Zhenyang Liu, Changxin Zhou, Shanlan Shen, Yubo Duan, Yiang Chen, Jin Guo, Mengru Yang, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu, Hu Peng, Chunming He
Title: NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results
Abstract:
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
中文: 本文综述了NTIRE 2025真实人脸复原挑战赛,重点介绍了提升感知质量和身份一致性的解决方案,共有13支团队提交模型,其中10支在最终排名中取得有效成绩。
English: This paper reviews the NTIRE 2025 challenge on real-world face restoration, focusing on solutions that enhance perceptual quality and identity consistency, with 13 teams submitting models and 10 achieving valid rankings.

Authors:Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu
Title: Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Abstract:
Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.
中文: Phoenix框架通过运动指令连接语义反思与机器人动作修正,采用双流程调整机制和扩散策略实现精确控制,并通过终身学习提升在动态环境中的适应能力。
English: The Phoenix framework bridges semantic reflection and robotic action correction through motion instruction, enabling precise adjustments via a dual-process mechanism and diffusion policy, with lifelong learning enhancing its adaptability in diverse environments.

Authors:Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen
Title: NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results
Abstract:
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
中文摘要:NTIRE 2025图像超分辨率挑战赛旨在通过双赛道设计(恢复赛道与感知赛道)推动超分辨率技术发展,共有25支团队提交有效方案,该竞赛成果将成为领域重要基准。
English Summary: The NTIRE 2025 image super-resolution challenge at CVPR 2025 seeks advanced solutions for reconstructing high-resolution images from downsampled inputs, featuring dual evaluation tracks for restoration accuracy and perceptual quality with 25 team submissions.

Authors:Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang
Title: NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
Abstract:
Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag
中文摘要:NoWag框架通过向量量化和剪枝技术高效压缩Llama-2与Llama-3等大语言模型,其性能超越现有方法并揭示了不同压缩范式间的内在关联。
English Summary: The NoWag framework effectively compresses large language models like Llama-2 and Llama-3 through vector quantization and pruning, outperforming existing methods and revealing commonalities in compression techniques.

Authors:Haiyan Qin, Zhiwei Xie, Jingjing Li, Liangchen Li, Xiaotong Feng, Junzhan Liu, Wang Kang
Title: ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model
Abstract:
Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three complementary innovations: (1) ReasoningV-5K, a high-quality dataset of 5,000 functionally verified instances with reasoning paths created through multi-dimensional filtering of PyraNet samples; (2) a two-stage training approach combining parameter-efficient fine-tuning for foundational knowledge with full-parameter optimization for enhanced reasoning; and (3) an adaptive reasoning mechanism that dynamically adjusts reasoning depth based on problem complexity, reducing token consumption by up to 75\% while preserving performance. Experimental results demonstrate ReasoningV's effectiveness with a pass@1 accuracy of 57.8\% on VerilogEval-human, achieving performance competitive with leading commercial models like Gemini-2.0-flash (59.5\%) and exceeding the previous best open-source model by 10.4 percentage points. ReasoningV offers a more reliable and accessible pathway for advancing AI-driven hardware design automation, with our model, data, and code available at https://github.com/BUAA-CLab/ReasoningV.
中文:本文提出ReasoningV模型,通过高质量数据集、两阶段训练和自适应推理机制,在Verilog代码生成中实现与主流商业模型相媲美的性能,同时显著降低计算成本。
English: This paper introduces ReasoningV, a hybrid reasoning model for Verilog code generation that combines a high-quality dataset, two-stage training, and adaptive reasoning to achieve competitive performance with leading commercial models while reducing computational costs.

Authors:Liang Peng, Boxi Wu, Haoran Cheng, Yibo Zhao, Xiaofei He
Title: SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization
Abstract:
Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.
中文摘要:本文提出SUDO方法,通过自监督方式优化文本到图像扩散模型的像素级细节和全局图像质量,无需昂贵数据标注即可显著提升图像结构连贯性。
English Summary: This paper introduces SUDO, a self-supervised method that optimizes both pixel-level details and global image quality in text-to-image diffusion models, eliminating the need for costly data annotation while significantly enhancing image coherence.

Authors:Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, Feng Guo
Title: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
Abstract:
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.
Chinese Summary: 视觉大语言模型在通用视觉任务中表现出色,但在自动驾驶等安全关键领域的应用仍存局限;新推出的DVBench基准测试揭示了现有模型在复杂驾驶场景理解上的不足,并通过领域微调显著提升了模型性能,为开发符合实际安全要求的视觉大语言模型提供了重要评估框架。
English Summary: Vision Large Language Models (VLLMs) show strong performance in general visual tasks but struggle with safety-critical autonomous driving scenarios, as demonstrated by the new DVBench benchmark which revealed significant performance gaps and the need for domain-specific fine-tuning to improve their applicability.

Authors:Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng
Title: DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue
Abstract:
Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.
中文:DialogueAgents框架通过三个专业代理协同生成富有表现力的多样化语音对话,创建了高质量的MultiTalk数据集,有效解决了现有数据集成本高、多样性不足的问题。
English: The DialogueAgents framework uses three specialized agents to collaboratively generate expressive, diverse speech dialogues, producing the high-quality MultiTalk dataset and addressing limitations of costly and limited existing datasets.

Authors:Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, Boyu Zhou
Title: ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion
Abstract:
Navigating unknown environments to find a target object is a significant challenge. While semantic information is crucial for navigation, relying solely on it for decision-making may not always be efficient, especially in environments with weak semantic cues. Additionally, many methods are susceptible to misdetections, especially in environments with visually similar objects. To address these limitations, we propose ApexNav, a zero-shot object navigation framework that is both more efficient and reliable. For efficiency, ApexNav adaptively utilizes semantic information by analyzing its distribution in the environment, guiding exploration through semantic reasoning when cues are strong, and switching to geometry-based exploration when they are weak. For reliability, we propose a target-centric semantic fusion method that preserves long-term memory of the target and similar objects, enabling robust object identification even under noisy detections. We evaluate ApexNav on the HM3Dv1, HM3Dv2, and MP3D datasets, where it outperforms state-of-the-art methods in both SR and SPL metrics. Comprehensive ablation studies further demonstrate the effectiveness of each module. Furthermore, real-world experiments validate the practicality of ApexNav in physical environments. The code will be released at https://github.com/Robotics-STAR-Lab/ApexNav.
中文: ApexNav是一种零样本目标导航框架,通过自适应地利用语义和几何探索提高效率,并采用以目标为中心的语义融合增强可靠性,在多个数据集上超越了现有最优方法。
English: ApexNav is a zero-shot object navigation framework that enhances efficiency by adaptively using semantic and geometric exploration and improves reliability through target-centric semantic fusion, outperforming state-of-the-art methods across multiple datasets.

Authors:Mingya Zhang, Liang Wang, Limei Gu, Tingsheng Ling, Xianping Tao
Title: WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
Abstract:
Semi-supervised medical image segmentation (SSMIS) shows promise in reducing reliance on scarce labeled medical data. However, SSMIS field confronts challenges such as distribution mismatches between labeled and unlabeled data, artificial perturbations causing training biases, and inadequate use of raw image information, especially low-frequency (LF) and high-frequency (HF) components.To address these challenges, we propose a Wavelet Transform based Bidirectional Copy-Paste SSMIS framework, named WT-BCP, which improves upon the Mean Teacher approach. Our method enhances unlabeled data understanding by copying random crops between labeled and unlabeled images and employs WT to extract LF and HF details.We propose a multi-input and multi-output model named XNet-Plus, to receive the fused information after WT. Moreover, consistency training among multiple outputs helps to mitigate learning biases introduced by artificial perturbations. During consistency training, the mixed images resulting from WT are fed into both models, with the student model's output being supervised by pseudo-labels and ground-truth. Extensive experiments conducted on 2D and 3D datasets confirm the effectiveness of our model.Code: https://github.com/simzhangbest/WT-BCP.
中文:WT-BCP框架通过结合小波变换提取频率细节和双向复制粘贴提升数据利用,有效解决了半监督医学图像分割中的分布不匹配和训练偏差问题。
English: The WT-BCP framework enhances semi-supervised medical image segmentation by integrating wavelet transform for frequency detail extraction and bidirectional copy-paste to improve data utilization, effectively addressing distribution mismatches and training biases.

Authors:Chuhao Liu, Zhijian Qiao, Jieqi Shi, Ke Wang, Peize Liu, Shaojie Shen
Title: SG-Reg: Generalizable and Efficient Scene Graph Registration
Abstract:
This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.
中文: 本文提出一种新型场景图网络,通过融合多模态语义特征实现刚性语义场景图的高效鲁棒配准,在显著提升配准成功率的同时,仅需极少的GPU资源和通信带宽。
English: This paper introduces a novel scene graph network that fuses multimodal semantic features for efficient and robust registration of rigid semantic scene graphs, significantly outperforming existing methods in success rate while requiring minimal GPU resources and communication bandwidth.

Authors:Qiang Chen, Xiao Wang, Haowen Wang, Bo Jiang, Lin Zhu, Dawei Zhang, Yonghong Tian, Jin Tang
Title: Adversarial Attack for RGB-Event based Visual Object Tracking
Abstract:
Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on https://github.com/Event-AHU/Adversarial_Attack_Defense
中文: 本文提出了一种针对RGB-Event视觉跟踪的跨模态对抗攻击算法,通过对事件体素和帧表示分别采用扰动优化策略生成对抗样本,在多数据集实验中显著降低了跟踪器的性能。
English: This paper introduces a cross-modal adversarial attack algorithm targeting RGB-Event visual tracking, which effectively degrades tracker performance by generating adversarial examples for both RGB-Event voxel and frame representations through optimized perturbation strategies.

Authors:Xiang Zhang, Yongfeng Zhang
Title: Planet as a Brain: Towards Internet of AgentSites based on AIOS Server
Abstract:
The internet is undergoing a historical transformation from the "Internet of Websites" to the "Internet of AgentSites." While traditional Websites served as the foundation for information hosting and dissemination, a new frontier is emerging where AgentSites serve as the hubs of the internet, where each AgentSite hosts one or more AI agents that receive tasks, address them, and deliver actionable solutions, marking a significant shift in the digital landscape and representing the next generation of online ecosystems. Under this vision, AIOS, the AI Agent Operating System, serves as the server for the development, deployment and execution of AI agents, which is a fundamental infrastructure for the Internet of Agentsites. In this paper, we introduce AIOS Server, a runtime framework to host agents and enable global-scale collaboration among decentralized agents. AIOS Server provides a communication protocol leveraging the Model Context Protocol (MCP) and JSON-RPC to enable agent-agent or human-agent interactions. Each AIOS node operates as a server to host and execute agents, while supporting peer-to-peer coordination without reliance on centralized orchestration. Based on AIOS Server, we further present the world's first practically deployed Internet of Agentsites (AIOS-IoA), including AgentHub for agent registration and discovery and AgentChat for interactive communication, at https://planet.aios.foundation. The agent discovery mechanism based on Distributed Hash Tables (DHT) and a Gossip protocol serves as the search engine for the internet of agentsites. This work provides a practical foundation for building the Internet of Agentsites-a new paradigm where autonomous agents become first-class citizens of the web. The implementation is available at https://github.com/agiresearch/AIOS.Server and is integrated into the AIOS main branch at https://github.com/agiresearch/AIOS.
中文摘要:互联网正从“网站互联网”向“智能体站点互联网”演进,AIOS作为底层操作系统通过MCP和JSON-RPC协议实现分布式AI智能体的任务执行与协同交互,并已部署首个实践平台AIOS-IoA。
English Summary: The internet is evolving from a "Websites" model to an "AgentSites" paradigm, where AIOS serves as the operating system enabling decentralized AI agents to perform tasks and collaborate globally through protocols like MCP and JSON-RPC.

Authors:Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo
Title: SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation
Abstract:
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff
中文: SphereDiff提出了一种新颖的球形潜在表示和采样方法,无需额外调优即可直接使用预训练扩散模型生成高质量360度全景图像和视频,有效解决了等距柱状投影带来的失真问题。
English: SphereDiff introduces a novel spherical latent representation and sampling method that enables high-fidelity 360-degree panoramic image and video generation using pretrained diffusion models without additional tuning, effectively mitigating distortions from equirectangular projection.

Authors:Mohammed Ayman Shalaby, Syed Shabbir Ahmed, Nicholas Dahdah, Charles Champagne Cossette, Jerome Le Ny, James Richard Forbes
Title: MILUV: A Multi-UAV Indoor Localization dataset with UWB and Vision
Abstract:
This paper introduces MILUV, a Multi-UAV Indoor Localization dataset with UWB and Vision measurements. This dataset comprises 217 minutes of flight time over 36 experiments using three quadcopters, collecting ultra-wideband (UWB) ranging data such as the raw timestamps and channel-impulse response data, vision data from a stereo camera and a bottom-facing monocular camera, inertial measurement unit data, height measurements from a laser rangefinder, magnetometer data, and ground-truth poses from a motion-capture system. The UWB data is collected from up to 12 transceivers affixed to mobile robots and static tripods in both line-of-sight and non-line-of-sight conditions. The UAVs fly at a maximum speed of 4.418 m/s in an indoor environment with visual fiducial markers as features. MILUV is versatile and can be used for a wide range of applications beyond localization, but the primary purpose of MILUV is for testing and validating multi-robot UWB- and vision-based localization algorithms. The dataset can be downloaded at https://doi.org/10.25452/figshare.plus.28386041.v1. A development kit is presented alongside the MILUV dataset, which includes benchmarking algorithms such as visual-inertial odometry, UWB-based localization using an extended Kalman filter, and classification of CIR data using machine learning approaches. The development kit can be found at https://github.com/decargroup/miluv, and is supplemented with a website available at https://decargroup.github.io/miluv/.
中文: 本文介绍了MILUV,这是一个包含超宽带、视觉和惯性数据的多无人机室内定位数据集,基于36次实验采集,主要用于测试多机器人定位算法,并提供了包含基准算法的开发工具包。
English: This paper presents MILUV, a comprehensive indoor localization dataset for multiple UAVs featuring UWB, vision, and inertial data collected from 36 experiments, designed primarily for testing multi-robot localization algorithms and available with a development kit for benchmarking.

Authors:Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna
Title: Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
Abstract:
Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW
Chinese: 提出的FLOW方法和FlexCiM架构通过实现分层稀疏度优化和灵活硬件部署,克服了大语言模型中固定N:M稀疏性的限制,在准确率、延迟和能耗方面均取得显著提升。
English: The proposed FLOW method and FlexCiM architecture overcome the limitations of fixed N:M sparsity in large language models by enabling layer-wise sparsity optimization and flexible hardware deployment, achieving significant improvements in accuracy, latency, and energy efficiency.

Authors:Youngbin Lee, Yejin Kim, Suin Kim, Yongjae Lee
Title: Integrating LLM-Generated Views into Mean-Variance Optimization Using the Black-Litterman Model
Abstract:
Portfolio optimization faces challenges due to the sensitivity in traditional mean-variance models. The Black-Litterman model mitigates this by integrating investor views, but defining these views remains difficult. This study explores the integration of large language models (LLMs) generated views into portfolio optimization using the Black-Litterman framework. Our method leverages LLMs to estimate expected stock returns from historical prices and company metadata, incorporating uncertainty through the variance in predictions. We conduct a backtest of the LLM-optimized portfolios from June 2024 to February 2025, rebalancing biweekly using the previous two weeks of price data. As baselines, we compare against the S&P 500, an equal-weighted portfolio, and a traditional mean-variance optimized portfolio constructed using the same set of stocks. Empirical results suggest that different LLMs exhibit varying levels of predictive optimism and confidence stability, which impact portfolio performance. The source code and data are available at https://github.com/youngandbin/LLM-MVO-BLM.
中文: 本研究将大型语言模型融入Black-Litterman框架,通过生成投资者观点优化投资组合,回测结果表明不同模型的预测乐观程度和置信稳定性会影响组合表现,与传统方法形成对比。
English: This study integrates large language models (LLMs) into the Black-Litterman framework to generate investor views for portfolio optimization, demonstrating through backtesting that different LLMs' predictive optimism and confidence stability influence portfolio performance compared to traditional methods.

Authors:Ionut-Gabriel Farcas, Rayomand P. Gundevia, Ramakanth Munipalli, Karen E. Willcox
Title: A parallel implementation of reduced-order modeling of large-scale systems
Abstract:
Motivated by the large-scale nature of modern aerospace engineering simulations, this paper presents a detailed description of distributed Operator Inference (dOpInf), a recently developed parallel algorithm designed to efficiently construct physics-based reduced-order models (ROMs) for problems with large state dimensions. One such example is the simulation of rotating detonation rocket engines, where snapshot data generated by high-fidelity large-eddy simulations have many millions of degrees of freedom. dOpInf enables, via distributed computing, the efficient processing of datasets with state dimensions that are too large to process on a single computer, and the learning of structured physics-based ROMs that approximate the dynamical systems underlying those datasets. All elements of dOpInf are scalable, leading to a fully parallelized reduced modeling approach that can scale to the thousands of processors available on leadership high-performance computing platforms. The resulting ROMs are computationally cheap, making them ideal for key engineering tasks such as design space exploration, risk assessment, and uncertainty quantification. To illustrate the practical application of dOpInf, we provide a step-by-step tutorial using a 2D Navier-Stokes flow over a step scenario as a case study. This tutorial guides users through the implementation process, making dOpInf accessible for integration into complex aerospace engineering simulations.
中文: 本文介绍了分布式算子推断(dOpInf)这一可扩展并行算法,用于高效构建基于物理的降阶模型以处理大规模航空航天仿真,适用于设计空间探索和不确定性量化等工程任务。
English: This paper introduces distributed Operator Inference (dOpInf), a scalable parallel algorithm for constructing physics-based reduced-order models to handle large-scale aerospace simulations efficiently, enabling applications like design exploration and uncertainty quantification.

Authors:Jiyuan Shi, Xinzhe Liu, Dewei Wang, Ouyang Lu, Sören Schwertfeger, Fuchun Sun, Chenjia Bai, Xuelong Li
Title: Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning
Abstract:
Humans exhibit diverse and expressive whole-body movements. However, attaining human-like whole-body coordination in humanoid robots remains challenging, as conventional approaches that mimic whole-body motions often neglect the distinct roles of upper and lower body. This oversight leads to computationally intensive policy learning and frequently causes robot instability and falls during real-world execution. To address these issues, we propose Adversarial Locomotion and Motion Imitation (ALMI), a novel framework that enables adversarial policy learning between upper and lower body. Specifically, the lower body aims to provide robust locomotion capabilities to follow velocity commands while the upper body tracks various motions. Conversely, the upper-body policy ensures effective motion tracking when the robot executes velocity-based movements. Through iterative updates, these policies achieve coordinated whole-body control, which can be extended to loco-manipulation tasks with teleoperation systems. Extensive experiments demonstrate that our method achieves robust locomotion and precise motion tracking in both simulation and on the full-size Unitree H1 robot. Additionally, we release a large-scale whole-body motion control dataset featuring high-quality episodic trajectories from MuJoCo simulations deployable on real robots. The project page is https://almi-humanoid.github.io.
中文摘要:ALMI框架通过上下半身的对抗性策略学习,实现了人形机器人稳健的运动能力和精确的动作跟踪,有效解决了传统方法中机器人不稳定和计算复杂的问题。
English Summary: The ALMI framework introduces adversarial policy learning between the upper and lower body to achieve robust locomotion and precise motion tracking in humanoid robots, overcoming traditional challenges of instability and computational intensity.

Authors:Ze Zhao, Bin Lu, Xiaoying Gan, Gu Tang, Luoyi Fu, Xinbing Wang
Title: CHAINSFORMER: Numerical Reasoning on Knowledge Graphs from a Chain Perspective
Abstract:
Reasoning over Knowledge Graphs (KGs) plays a pivotal role in knowledge graph completion or question answering systems, providing richer and more accurate triples and attributes. As numerical attributes become increasingly essential in characterizing entities and relations in KGs, the ability to reason over these attributes has gained significant importance. Existing graph-based methods such as Graph Neural Networks (GNNs) and Knowledge Graph Embeddings (KGEs), primarily focus on aggregating homogeneous local neighbors and implicitly embedding diverse triples. However, these approaches often fail to fully leverage the potential of logical paths within the graph, limiting their effectiveness in exploiting the reasoning process. To address these limitations, we propose ChainsFormer, a novel chain-based framework designed to support numerical reasoning. Chainsformer not only explicitly constructs logical chains but also expands the reasoning depth to multiple hops. Specially, we introduces Relation-Attribute Chains (RA-Chains), a specialized logic chain, to model sequential reasoning patterns. ChainsFormer captures the step-by-step nature of multi-hop reasoning along RA-Chains by employing sequential in-context learning. To mitigate the impact of noisy chains, we propose a hyperbolic affinity scoring mechanism that selects relevant logic chains in a variable-resolution space. Furthermore, ChainsFormer incorporates an attention-based numerical reasoner to identify critical reasoning paths, enhancing both reasoning accuracy and transparency. Experimental results demonstrate that ChainsFormer significantly outperforms state-of-the-art methods, achieving up to a 20.0% improvement in performance. The implementations are available at https://github.com/zhaodazhuang2333/ChainsFormer.
Chinese: ChainsFormer是一种新颖的基于链的框架,通过显式构建逻辑链、扩展推理深度,并采用机制减少噪声和识别关键路径,显著提升了知识图谱上的数值推理能力,性能比现有最优方法提高了高达20.0%。
English: ChainsFormer is a novel chain-based framework that enhances numerical reasoning over Knowledge Graphs by explicitly constructing logical chains, expanding reasoning depth, and employing mechanisms to mitigate noise and identify critical paths, achieving up to a 20.0% performance improvement over state-of-the-art methods.

Authors:Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, Irwin King
Title: CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey
Abstract:
As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for enhancing model robustness across diverse environments. Contrastive Language-Image Pretraining (CLIP) plays a significant role in these tasks, offering powerful zero-shot capabilities that allow models to perform effectively in unseen domains. However, there remains a significant gap in the literature, as no comprehensive survey currently exists that systematically explores the applications of CLIP in DG and DA, highlighting the necessity for this review. This survey presents a comprehensive review of CLIP's applications in DG and DA. In DG, we categorize methods into optimizing prompt learning for task alignment and leveraging CLIP as a backbone for effective feature extraction, both enhancing model adaptability. For DA, we examine both source-available methods utilizing labeled source data and source-free approaches primarily based on target domain data, emphasizing knowledge transfer mechanisms and strategies for improved performance across diverse contexts. Key challenges, including overfitting, domain diversity, and computational efficiency, are addressed, alongside future research opportunities to advance robustness and efficiency in practical applications. By synthesizing existing literature and pinpointing critical gaps, this survey provides valuable insights for researchers and practitioners, proposing directions for effectively leveraging CLIP to enhance methodologies in domain generalization and adaptation. Ultimately, this work aims to foster innovation and collaboration in the quest for more resilient machine learning models that can perform reliably across diverse real-world scenarios. A more up-to-date version of the papers is maintained at: https://github.com/jindongli-Ai/Survey_on_CLIP-Powered_Domain_Generalization_and_Adaptation.
中文: 本综述系统梳理了CLIP在领域泛化与自适应中的应用方法,通过分类讨论优化策略并应对过拟合、领域差异等关键挑战,旨在提升模型在多样化场景中的鲁棒性能。
English: This survey comprehensively reviews CLIP's applications in domain generalization and adaptation, categorizing methods and addressing challenges like overfitting and domain diversity to enhance model robustness across diverse environments.

Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Title: Cross-attention for State-based model RWKV-7
Abstract:
We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention
Chinese: CrossWKV是RWKV-7模型中的一种新型交叉注意力机制,通过单次处理整合文本和图像模态,以线性复杂度实现卓越的跨模态对齐,在文本到图像生成中达到领先性能。
English: CrossWKV is a novel cross-attention mechanism for the RWKV-7 model that enhances text-to-image generation by integrating text and image modalities in a single pass, achieving state-of-the-art performance with superior cross-modal alignment and linear complexity.

Authors:Jie Wang, Nana Yu, Zihao Zhang, Yahong Han
Title: Visual Consensus Prompting for Co-Salient Object Detection
Abstract:
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.
Chinese: 现有协同显著目标检测方法存在共识提取滞后和参数效率低下的问题,因此本文提出了一种视觉共识提示(VCP)架构,通过参数高效的提示调优将共识嵌入提示,在冻结基础模型的同时显著提升了检测性能。
English: Current CoSOD methods face inefficiencies due to delayed consensus guidance and parameter-heavy fine-tuning, leading to the proposal of a novel Visual Consensus Prompts (VCP) architecture that integrates parameter-efficient prompt tuning to enhance performance with minimal updates.

Authors:Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Abstract:
Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.
中文: 该摘要提出利用多模态大语言模型进行可解释的虚假图像检测,通过设计六种专用提示词构建新型框架,相比传统方法在鲁棒性和可解释性方面实现显著提升。
English: This abstract proposes using Multi-modal Large Language Models for explainable fake image detection, developing a framework with six specialized prompts to enhance robustness and transparency over traditional methods.

Authors:Yimeng Bai, Shunyu Zhang, Yang Zhang, Hu Liu, Wentian Bao, Enyun Yu, Fuli Feng, Wenwu Ou
Title: Unconstrained Monotonic Calibration of Predictions in Deep Ranking Systems
Abstract:
Ranking models primarily focus on modeling the relative order of predictions while often neglecting the significance of the accuracy of their absolute values. However, accurate absolute values are essential for certain downstream tasks, necessitating the calibration of the original predictions. To address this, existing calibration approaches typically employ predefined transformation functions with order-preserving properties to adjust the original predictions. Unfortunately, these functions often adhere to fixed forms, such as piece-wise linear functions, which exhibit limited expressiveness and flexibility, thereby constraining their effectiveness in complex calibration scenarios. To mitigate this issue, we propose implementing a calibrator using an Unconstrained Monotonic Neural Network (UMNN), which can learn arbitrary monotonic functions with great modeling power. This approach significantly relaxes the constraints on the calibrator, improving its flexibility and expressiveness while avoiding excessively distorting the original predictions by requiring monotonicity. Furthermore, to optimize this highly flexible network for calibration, we introduce a novel additional loss function termed Smooth Calibration Loss (SCLoss), which aims to fulfill a necessary condition for achieving the ideal calibration state. Extensive offline experiments confirm the effectiveness of our method in achieving superior calibration performance. Moreover, deployment in Kuaishou's large-scale online video ranking system demonstrates that the method's calibration improvements translate into enhanced business metrics. The source code is available at https://github.com/baiyimeng/UMC.
Chinese: 本文提出了一种采用无约束单调神经网络和光滑校准损失的方法,通过学习灵活的单调函数来优化预测校准,显著提升了排序系统的离线精度和在线业务指标。
English: The paper introduces an Unconstrained Monotonic Neural Network (UMNN) with a Smooth Calibration Loss to enhance prediction calibration by learning flexible monotonic functions, improving both offline accuracy and online business performance in ranking systems.

Authors:Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, Fei Wu
Title: InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Abstract:
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.
中文: 通过Actor2Reasoner框架开发的InfiGUI-R1代理,采用两阶段训练方法将GUI代理从反应型执行者转变为审慎型推理者,通过增强推理和错误恢复能力,在复杂GUI任务中实现更优性能。
English: The InfiGUI-R1 agent, developed through the Actor2Reasoner framework, transitions GUI agents from reactive actors to deliberative reasoners using a two-stage training approach that enhances reasoning and error recovery for improved performance in complex GUI tasks.

Authors:Yong-En Tian, Yu-Chien Tang, Kuang-Da Wang, An-Zi Yen, Wen-Chih Peng
Title: Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval
Abstract:
Tailoring structured financial reports from companies' earnings releases is crucial for understanding financial performance and has been widely adopted in real-world analytics. However, existing summarization methods often generate broad, high-level summaries, which may lack the precision and detail required for financial reports that typically focus on specific, structured sections. While Large Language Models (LLMs) hold promise, generating reports adhering to predefined multi-section templates remains challenging. This paper investigates two LLM-based approaches popular in industry for generating templated financial reports: an agentic information retrieval (IR) framework and a decomposed IR approach, namely AgenticIR and DecomposedIR. The AgenticIR utilizes collaborative agents prompted with the full template. In contrast, the DecomposedIR approach applies a prompt chaining workflow to break down the template and reframe each section as a query answered by the LLM using the earnings release. To quantitatively assess the generated reports, we evaluated both methods in two scenarios: one using a financial dataset without direct human references, and another with a weather-domain dataset featuring expert-written reports. Experimental results show that while AgenticIR may excel in orchestrating tasks and generating concise reports through agent collaboration, DecomposedIR statistically significantly outperforms AgenticIR approach in providing broader and more detailed coverage in both scenarios, offering reflection on the utilization of the agentic framework in real-world applications.
中文: 本研究比较了两种基于大语言模型的财务报告生成方法,发现虽然AgenticIR通过智能体协作能生成简洁摘要,但DecomposedIR在金融和气象领域数据集中均能提供更广泛、更详细的覆盖内容,表现显著更优。
English: This study compares two LLM-based methods for generating structured financial reports, finding that while AgenticIR produces concise summaries through agent collaboration, DecomposedIR significantly outperforms in providing broader and more detailed coverage across both financial and weather-domain datasets.

Authors:Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang
Title: Understanding the Repeat Curse in Large Language Models from a Feature Perspective
Abstract:
Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse". While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, "Duplicatus Charm", to induce and analyze the Repeat Curse. Our method systematically identifies "Repetition Features" -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse. The source code of our work is publicly available at: https://github.com/kaustpradalab/repeat-curse-llm
中文: 本研究通过机制可解释性探究大语言模型生成重复文本的根本原因,提出"Duplicatus Charm"方法识别并停用重复特征,有效缓解"重复诅咒"问题,同时公开了源代码。
English: This study investigates the root causes of repetitive text generation in large language models through mechanistic interpretability, proposing the "Duplicatus Charm" method to identify and deactivate repetition features, effectively mitigating the "Repeat Curse" while making the source code publicly available.

Authors:Pierre-Alain Fayolle, Evgenii Maltsev
Title: PyFRep: Shape Modeling with Differentiable Function Representation
Abstract:
We propose a framework for performing differentiable geometric modeling based on the Function Representation (FRep). The framework is built on top of modern libraries for performing automatic differentiation allowing us to obtain derivatives w.r.t. space or shape parameters. We demonstrate possible applications of this framework: Curvature estimation for shape interrogation, signed distance function computation and approximation and fitting shape parameters of a parametric model to data. Our framework is released as open-source.
我们提出了一个基于FRep的可微分几何建模框架,利用自动微分计算形状参数导数,应用于曲率估计、SDF计算和参数拟合,并已开源发布。
We introduce a differentiable geometric modeling framework using FRep that leverages automatic differentiation to compute derivatives for shape parameters, with applications in curvature estimation, SDF computation, and parametric fitting, released as open-source.

Authors:Hongji Li, Hanwen Du, Youhua Li, Junchen Fu, Chunxiao Li, Ziyi Zhuang, Jiakang Li, Yongxin Ni
Title: Teach Me How to Denoise: A Universal Framework for Denoising Multi-modal Recommender Systems via Guided Calibration
Abstract:
The surge in multimedia content has led to the development of Multi-Modal Recommender Systems (MMRecs), which use diverse modalities such as text, images, videos, and audio for more personalized recommendations. However, MMRecs struggle with noisy data caused by misalignment among modal content and the gap between modal semantics and recommendation semantics. Traditional denoising methods are inadequate due to the complexity of multi-modal data. To address this, we propose a universal guided in-sync distillation denoising framework for multi-modal recommendation (GUIDER), designed to improve MMRecs by denoising user feedback. Specifically, GUIDER uses a re-calibration strategy to identify clean and noisy interactions from modal content. It incorporates a Denoising Bayesian Personalized Ranking (DBPR) loss function to handle implicit user feedback. Finally, it applies a denoising knowledge distillation objective based on Optimal Transport distance to guide the alignment from modality representations to recommendation semantics. GUIDER can be seamlessly integrated into existing MMRecs methods as a plug-and-play solution. Experimental results on four public datasets demonstrate its effectiveness and generalizability. Our source code is available at https://github.com/Neon-Jing/Guider
中文: 提出的GUIDER框架通过交互重校准和知识蒸馏对用户反馈进行去噪,增强了多模态推荐系统,作为一种即插即用的解决方案,有效提升了不同数据集上的推荐准确性。
English: The proposed GUIDER framework enhances Multi-Modal Recommender Systems by denoising user feedback through interaction re-calibration and knowledge distillation, offering a plug-and-play solution that improves recommendation accuracy across diverse datasets.

Authors:Mingzhe Han, Dongsheng Li, Jiafeng Xia, Jiahao Liu, Hansu Gu, Peng Zhang, Ning Gu, Tun Lu
Title: FedCIA: Federated Collaborative Information Aggregation for Privacy-Preserving Recommendation
Abstract:
Recommendation algorithms rely on user historical interactions to deliver personalized suggestions, which raises significant privacy concerns. Federated recommendation algorithms tackle this issue by combining local model training with server-side model aggregation, where most existing algorithms use a uniform weighted summation to aggregate item embeddings from different client models. This approach has three major limitations: 1) information loss during aggregation, 2) failure to retain personalized local features, and 3) incompatibility with parameter-free recommendation algorithms. To address these limitations, we first review the development of recommendation algorithms and recognize that their core function is to share collaborative information, specifically the global relationship between users and items. With this understanding, we propose a novel aggregation paradigm named collaborative information aggregation, which focuses on sharing collaborative information rather than item parameters. Based on this new paradigm, we introduce the federated collaborative information aggregation (FedCIA) method for privacy-preserving recommendation. This method requires each client to upload item similarity matrices for aggregation, which allows clients to align their local models without constraining embeddings to a unified vector space. As a result, it mitigates information loss caused by direct summation, preserves the personalized embedding distributions of individual clients, and supports the aggregation of parameter-free models. Theoretical analysis and experimental results on real-world datasets demonstrate the superior performance of FedCIA compared with the state-of-the-art federated recommendation algorithms. Code is available at https://github.com/Mingzhe-Han/FedCIA.
中文摘要:本文提出FedCIA方法,通过聚合项目相似度矩阵而非嵌入向量,在保护隐私的同时保留个性化特征并支持无参数模型,解决了传统联邦推荐算法的局限性。
English Summary: This paper introduces FedCIA, a novel federated recommendation method that aggregates item similarity matrices instead of embeddings to preserve privacy while maintaining personalized features and supporting parameter-free models.

Authors:Wenxin Zhang, Cuicui Luo
Title: Decomposition-based multi-scale transformer framework for time series anomaly detection
Abstract:
Time series anomaly detection is crucial for maintaining stable systems. Existing methods face two main challenges. First, it is difficult to directly model the dependencies of diverse and complex patterns within the sequences. Second, many methods that optimize parameters using mean squared error struggle with noise in the time series, leading to performance deterioration. To address these challenges, we propose a transformer-based framework built on decomposition (TransDe) for multivariate time series anomaly detection. The key idea is to combine the strengths of time series decomposition and transformers to effectively learn the complex patterns in normal time series data. A multi-scale patch-based transformer architecture is proposed to exploit the representative dependencies of each decomposed component of the time series. Furthermore, a contrastive learn paradigm based on patch operation is proposed, which leverages KL divergence to align the positive pairs, namely the pure representations of normal patterns between different patch-level views. A novel asynchronous loss function with a stop-gradient strategy is further introduced to enhance the performance of TransDe effectively. It can avoid time-consuming and labor-intensive computation costs in the optimization process. Extensive experiments on five public datasets are conducted and TransDe shows superiority compared with twelve baselines in terms of F1 score. Our code is available at https://github.com/shaieesss/TransDe.
中文: 提出的TransDe框架结合时间序列分解与Transformer架构,通过多尺度补丁转换器和对比学习范式有效解决复杂模式依赖建模和噪声敏感问题,在多元时间序列异常检测中表现出优越性能。
English: The proposed TransDe framework combines time series decomposition with transformers and contrastive learning to effectively detect anomalies in multivariate time series by addressing pattern dependency modeling and noise sensitivity issues.

Authors:Wenxin Zhang, Jingxing Zhong, Guangzhen Yao, Renda Han, Xiaojian Lin, Zeyu Zhang, Cuicui Luo
Title: Dual-channel Heterophilic Message Passing for Graph Fraud Detection
Abstract:
Fraudulent activities have significantly increased across various domains, such as e-commerce, online review platforms, and social networks, making fraud detection a critical task. Spatial Graph Neural Networks (GNNs) have been successfully applied to fraud detection tasks due to their strong inductive learning capabilities. However, existing spatial GNN-based methods often enhance the graph structure by excluding heterophilic neighbors during message passing to align with the homophilic bias of GNNs. Unfortunately, this approach can disrupt the original graph topology and increase uncertainty in predictions. To address these limitations, this paper proposes a novel framework, Dual-channel Heterophilic Message Passing (DHMP), for fraud detection. DHMP leverages a heterophily separation module to divide the graph into homophilic and heterophilic subgraphs, mitigating the low-pass inductive bias of traditional GNNs. It then applies shared weights to capture signals at different frequencies independently and incorporates a customized sampling strategy for training. This allows nodes to adaptively balance the contributions of various signals based on their labels. Extensive experiments on three real-world datasets demonstrate that DHMP outperforms existing methods, highlighting the importance of separating signals with different frequencies for improved fraud detection. The code is available at https://github.com/shaieesss/DHMP.
中文: 本文提出双通道异质信息传递(DHMP)框架,通过分离图中的同质与异质信号来克服传统空间图神经网络的局限,在真实数据集上实现了更优的欺诈检测性能。
English: This paper introduces the Dual-channel Heterophilic Message Passing (DHMP) framework, which improves fraud detection by separating homophilic and heterophilic signals in graphs to overcome the limitations of traditional spatial GNNs, achieving superior performance on real-world datasets.

Authors:Wenxin Zhang, Xiaojian Lin, Wenjun Yu, Guangzhen Yao, jingxiang Zhong, Yu Li, Renda Han, Songcheng Xu, Hao Shi, Cuicui Luo
Title: DConAD: A Differencing-based Contrastive Representation Learning Framework for Time Series Anomaly Detection
Abstract:
Time series anomaly detection holds notable importance for risk identification and fault detection across diverse application domains. Unsupervised learning methods have become popular because they have no requirement for labels. However, due to the challenges posed by the multiplicity of abnormal patterns, the sparsity of anomalies, and the growth of data scale and complexity, these methods often fail to capture robust and representative dependencies within the time series for identifying anomalies. To enhance the ability of models to capture normal patterns of time series and avoid the retrogression of modeling ability triggered by the dependencies on high-quality prior knowledge, we propose a differencing-based contrastive representation learning framework for time series anomaly detection (DConAD). Specifically, DConAD generates differential data to provide additional information about time series and utilizes transformer-based architecture to capture spatiotemporal dependencies, which enhances the robustness of unbiased representation learning ability. Furthermore, DConAD implements a novel KL divergence-based contrastive learning paradigm that only uses positive samples to avoid deviation from reconstruction and deploys the stop-gradient strategy to compel convergence. Extensive experiments on five public datasets show the superiority and effectiveness of DConAD compared with nine baselines. The code is available at https://github.com/shaieesss/DConAD.
中文: 提出的DConAD框架通过生成差分数据来捕捉稳健的时空依赖关系,并采用仅使用正样本的对比学习范式,有效提升了时间序列异常检测性能,在多个数据集上展现出优越性。
English: The proposed DConAD framework enhances time series anomaly detection by generating differential data to capture robust spatiotemporal dependencies and implementing a contrastive learning paradigm using only positive samples, demonstrating superior performance across multiple datasets.

Authors:Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
Title: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Abstract:
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater,a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.
中文摘要:Meta-rater作为一种多维度数据筛选方法,通过专业性、可读性、逻辑性和洁净度四个维度评估数据质量,使13亿参数模型的训练速度提升两倍,下游任务性能提高3.23分,并能扩展至72亿参数模型。
English Summary: Meta-rater is a multi-dimensional data selection method that evaluates data quality across professionalism, readability, reasoning, and cleanliness, significantly accelerating model convergence by 2x and improving downstream task performance by 3.23 points for LLMs up to 7.2B parameters.

Authors:Zekai Chen, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang
Title: Rethinking Client-oriented Federated Graph Learning
Abstract:
As a new distributed graph learning paradigm, Federated Graph Learning (FGL) facilitates collaborative model training across local systems while preserving data privacy. We review existing FGL approaches and categorize their optimization mechanisms into: (1) Server-Client (S-C), where clients upload local model parameters for server-side aggregation and global updates; (2) Client-Client (C-C), which allows direct exchange of information between clients and customizing their local training process. We reveal that C-C shows superior potential due to its refined communication structure. However, existing C-C methods broadcast redundant node representations, incurring high communication costs and privacy risks at the node level. To this end, we propose FedC4, which combines graph Condensation with C-C Collaboration optimization. Specifically, FedC4 employs graph condensation technique to refine the knowledge of each client's graph into a few synthetic embeddings instead of transmitting node-level knowledge. Moreover, FedC4 introduces three novel modules that allow the source client to send distinct node representations tailored to the target client's graph properties. Experiments on eight public real-world datasets show that FedC4 outperforms state-of-the-art baselines in both task performance and communication cost. Our code is now available on https://github.com/Ereshkigal1/FedC4.
中文摘要:联邦图学习(FGL)在保护数据隐私的同时实现协同训练,而提出的FedC4方法通过图压缩技术优化客户端间协作,有效降低通信成本与隐私风险,在性能和效率上均优于现有方法。
English Summary: Federated Graph Learning (FGL) enables collaborative training while preserving data privacy, and the proposed FedC4 method enhances Client-Client optimization by using graph condensation to reduce communication costs and privacy risks, outperforming existing methods in both performance and efficiency.

Authors:Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang, Pei Wang, Rizwan Qureshi, Shaina Raza, Aman Chadha, Yong Xiang, Zhixiang Chen
Title: DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
Abstract:
We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.
Chinese: DanceText是一种无需训练的多语言图像文本编辑框架,通过分层编辑策略和深度感知模块,在复杂几何变换下实现文本与背景的无缝融合,并展现出卓越的视觉质量。
English: DanceText is a training-free framework for multilingual text editing in images that uses a layered approach and depth-aware module to enable complex geometric transformations while maintaining seamless foreground-background integration and superior visual quality.

Authors:Wei Dong, Han Zhou, Seyed Amirreza Mousavi, Jun Chen
Title: Retinex-guided Histogram Transformer for Mask-free Shadow Removal
Abstract:
While deep learning methods have achieved notable progress in shadow removal, many existing approaches rely on shadow masks that are difficult to obtain, limiting their generalization to real-world scenes. In this work, we propose ReHiT, an efficient mask-free shadow removal framework based on a hybrid CNN-Transformer architecture guided by Retinex theory. We first introduce a dual-branch pipeline to separately model reflectance and illumination components, and each is restored by our developed Illumination-Guided Hybrid CNN-Transformer (IG-HCT) module. Second, besides the CNN-based blocks that are capable of learning residual dense features and performing multi-scale semantic fusion, multi-scale semantic fusion, we develop the Illumination-Guided Histogram Transformer Block (IGHB) to effectively handle non-uniform illumination and spatially complex shadows. Extensive experiments on several benchmark datasets validate the effectiveness of our approach over existing mask-free methods. Trained solely on the NTIRE 2025 Shadow Removal Challenge dataset, our solution delivers competitive results with one of the smallest parameter sizes and fastest inference speeds among top-ranked entries, highlighting its applicability for real-world applications with limited computational resources. The code is available at https://github.com/dongw22/oath.
Chinese: 提出的ReHiT框架基于Retinex理论,采用混合CNN-Transformer架构实现无需阴影掩码的阴影去除方法,在基准数据集上以最小的计算资源取得了具有竞争力的性能。
English: The proposed ReHiT framework introduces a mask-free shadow removal method using a hybrid CNN-Transformer guided by Retinex theory, achieving efficient performance with minimal computational resources and competitive results on benchmark datasets.

Authors:Wei Dong, Yan Min, Han Zhou, Jun Chen
Title: Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design
Abstract:
Current Low-light Image Enhancement (LLIE) techniques predominantly rely on either direct Low-Light (LL) to Normal-Light (NL) mappings or guidance from semantic features or illumination maps. Nonetheless, the intrinsic ill-posedness of LLIE and the difficulty in retrieving robust semantics from heavily corrupted images hinder their effectiveness in extremely low-light environments. To tackle this challenge, we present SG-LLIE, a new multi-scale CNN-Transformer hybrid framework guided by structure priors. Different from employing pre-trained models for the extraction of semantics or illumination maps, we choose to extract robust structure priors based on illumination-invariant edge detectors. Moreover, we develop a CNN-Transformer Hybrid Structure-Guided Feature Extractor (HSGFE) module at each scale with in the UNet encoder-decoder architecture. Besides the CNN blocks which excels in multi-scale feature extraction and fusion, we introduce a Structure-Guided Transformer Block (SGTB) in each HSGFE that incorporates structural priors to modulate the enhancement process. Extensive experiments show that our method achieves state-of-the-art performance on several LLIE benchmarks in both quantitative metrics and visual quality. Our solution ranks second in the NTIRE 2025 Low-Light Enhancement Challenge. Code is released at https://github.com/minyan8/imagine.
Chinese: SG-LLIE提出了一种多尺度CNN-Transformer混合框架,通过基于光照不变边缘检测器提取稳健结构先验来突破现有低光照图像增强方法的局限,在多个基准测试中取得最优性能,并荣获NTIRE 2025低光照增强挑战赛第二名。
English: SG-LLIE introduces a multi-scale CNN-Transformer hybrid framework that leverages robust structure priors from illumination-invariant edge detection to overcome the limitations of existing low-light image enhancement methods, achieving state-of-the-art results on benchmarks and ranking second in the NTIRE 2025 challenge.

Authors:Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy Dvijotham
Title: DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Abstract:
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and $τ$-bench (for tool calling agents); 2) It is configurable and allows for detailed threat modeling, allowing configuration of specific components of the agentic framework being attackable, and specifying targets for the attacker; and 3) It is modular and decouples the development of attacks from details of the environment in which the agent is deployed, allowing for the same attacks to be applied across multiple environments. We illustrate several advantages of our framework, including the ability to adapt to new threat models and environments easily, the ability to easily combine several previously published attacks to enable comprehensive and fine-grained security testing, and the ability to analyze trade-offs between various vulnerabilities and performance. We apply DoomArena to state-of-the-art (SOTA) web and tool-calling agents and find a number of surprising results: 1) SOTA agents have varying levels of vulnerability to different threat models (malicious user vs malicious environment), and there is no Pareto dominant agent across all threat models; 2) When multiple attacks are applied to an agent, they often combine constructively; 3) Guardrail model-based defenses seem to fail, while defenses based on powerful SOTA LLMs work better. DoomArena is available at https://github.com/ServiceNow/DoomArena.
中文: DoomArena 是一个插件化、可配置且模块化的 AI 智能体安全评估框架,支持细粒度威胁建模与跨环境攻击测试,研究发现现有先进智能体存在多种漏洞且传统护栏防御效果有限。
English: DoomArena is a plug-in, configurable, and modular security evaluation framework for AI agents that enables detailed threat modeling and cross-environment attack development, revealing vulnerabilities in state-of-the-art agents and ineffective guardrail defenses.

Authors:Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy Dvijotham
Title: DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Abstract:
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and $τ$-bench (for tool calling agents); 2) It is configurable and allows for detailed threat modeling, allowing configuration of specific components of the agentic framework being attackable, and specifying targets for the attacker; and 3) It is modular and decouples the development of attacks from details of the environment in which the agent is deployed, allowing for the same attacks to be applied across multiple environments. We illustrate several advantages of our framework, including the ability to adapt to new threat models and environments easily, the ability to easily combine several previously published attacks to enable comprehensive and fine-grained security testing, and the ability to analyze trade-offs between various vulnerabilities and performance. We apply DoomArena to state-of-the-art (SOTA) web and tool-calling agents and find a number of surprising results: 1) SOTA agents have varying levels of vulnerability to different threat models (malicious user vs malicious environment), and there is no Pareto dominant agent across all threat models; 2) When multiple attacks are applied to an agent, they often combine constructively; 3) Guardrail model-based defenses seem to fail, while defenses based on powerful SOTA LLMs work better. DoomArena is available at https://github.com/ServiceNow/DoomArena.
中文: DoomArena 是一个插件化、可配置且模块化的 AI 智能体安全评估框架,支持细粒度威胁建模与跨环境攻击测试,研究发现现有先进智能体存在多种漏洞且传统护栏防御效果有限。
English: DoomArena is a plug-in, configurable, and modular security evaluation framework for AI agents that enables detailed threat modeling and cross-environment attack development, revealing vulnerabilities in state-of-the-art agents and ineffective guardrail defenses.

Authors:Kai Chen, Xiaochen Li, Chen Gong, Ryan McKenna, Tianhao Wang
Title: Benchmarking Differentially Private Tabular Data Synthesis
Abstract:
Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, lack of in-depth algorithm analysis, and incomplete comparisons due to overlapping development timelines. These factors create significant obstacles to selecting appropriate algorithms. In this paper, we address these challenges by proposing a benchmark for evaluating tabular data synthesis methods. We present a unified evaluation framework that integrates data preprocessing, feature selection, and synthesis modules, facilitating fair and comprehensive comparisons. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. Some statistical methods are superior in synthesis utility, but their efficiency is not as good as most machine learning-based methods. Furthermore, we conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies.
中文: 本文提出了一个评估差分隐私表格数据合成方法的基准,通过统一框架揭示了统计方法与机器学习方法在效用和效率上的权衡,并借助实验验证提供了不同策略优缺点的理论分析。
English: This paper introduces a benchmark for evaluating differentially private tabular data synthesis methods, proposing a unified framework that reveals a utility-efficiency trade-off between statistical and machine learning approaches while providing theoretical insights through experimental validation.

Authors:Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang
Title: LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Abstract:
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.
中文: 本研究提出了一种新颖的特征上采样方法,通过基于坐标的交叉注意力变换器和自蒸馏伪地面实况特征来提高视觉基础模型的分辨率,在像素级任务中显著优于现有技术。
English: This study introduces a novel feature upsampling method using a coordinate-based cross-attention transformer and self-distilled pseudo-groundtruth features to enhance the resolution of vision foundation models, significantly outperforming existing techniques in pixel-level tasks.

Authors:Mehmet Yamaç, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj
Title: Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing
Abstract:
Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at https://github.com/mehmetyamac/MTSNet.
中文摘要:本文提出的多尺度张量求和分解作为一种新型神经网络层,在减少参数的同时,在多种计算机视觉任务中展现出优于传统全连接层和卷积网络的性能。
English summary: This paper introduces Multiscale Tensor Summation (MTS) Factorization, a novel neural network layer that reduces parameters while outperforming traditional MLPs and CNNs across various computer vision tasks.

Authors:Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng
Title: CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee
Abstract:
Accurate short-term traffic demand prediction is critical for the operation of traffic systems. Besides point estimation, the confidence interval of the prediction is also of great importance. Many models for traffic operations, such as shared bike rebalancing and taxi dispatching, take into account the uncertainty of future demand and require confidence intervals as the input. However, existing methods for confidence interval modeling rely on strict assumptions, such as unchanging traffic patterns and correct model specifications, to guarantee enough coverage. Therefore, the confidence intervals provided could be invalid, especially in a changing traffic environment. To fill this gap, we propose an efficient method, CONTINA (Conformal Traffic Intervals with Adaptation) to provide interval predictions that can adapt to external changes. By collecting the errors of interval during deployment, the method can adjust the interval in the next step by widening it if the errors are too large or shortening it otherwise. Furthermore, we theoretically prove that the coverage of the confidence intervals provided by our method converges to the target coverage level. Experiments across four real-world datasets and prediction models demonstrate that the proposed method can provide valid confidence intervals with shorter lengths. Our method can help traffic management personnel develop a more reasonable and robust operation plan in practice. And we release the code, model and dataset in \href{ https://github.com/xiannanhuang/CONTINA/}{ Github}.
中文: 所提出的CONTINA方法能够自适应调整交通需求预测区间,在动态环境中确保置信度覆盖的有效性,通过真实数据集验证表明其能以更短的区间长度提供更可靠的预测结果。
English: The proposed method CONTINA adaptively adjusts traffic demand prediction intervals to ensure valid confidence coverage in changing environments, outperforming existing approaches by providing shorter yet more reliable intervals across real-world datasets.

Authors:Zhongxi Qiu, Zhang Zhang, Yan Hu, Heng Li, Jiang Liu
Title: Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain
Abstract:
This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.
中文摘要:本研究探讨了医疗领域中验证奖励强化学习的最佳数据选择策略,发现经过筛选的训练数据通常能提升模型性能,而自筛选样本在医疗领域表现优异但会降低跨领域鲁棒性。
English Summary: This study examines optimal data selection methods for Reinforcement Learning with Verified Rewards in medical applications, finding that filtered training data generally enhances performance while self-filtered samples excel in medical domains but reduce cross-domain robustness.

Authors:Zhanglin Wu, Tengfei Song, Ning Xie, Mengli Zhu, Weidong Zhang, Shuang Wu, Pengfei Li, Chong Li, Junhao Zhu, Hao Yang, Shiliang Sun
Title: Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Abstract:
The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.
中文: MOTBench是一个专门评估大型视觉语言模型的框架,用于测试其准确识别和翻译具有复杂布局及文化特定元素的菜单的能力,为未来发展提供指导。
English: MOTBench is a specialized evaluation framework for large vision-language models that assesses their ability to accurately recognize and translate complex menu layouts with culturally specific elements, providing insights for future development.

Authors:Kunihiko Fujiwara, Ryuta Tsurumi, Tomoki Kiyono, Zicheng Fan, Xiucheng Liang, Binyu Lei, Winston Yap, Koichi Ito, Filip Biljecki
Title: VoxCity: A Seamless Framework for Open Geospatial Data Integration, Grid-Based Semantic 3D City Model Generation, and Urban Environment Simulation
Abstract:
Three-dimensional urban environment simulation is a powerful tool for informed urban planning. However, the intensive manual effort required to prepare input 3D city models has hindered its widespread adoption. To address this challenge, we present VoxCity, an open-source Python package that provides a one-stop solution for grid-based 3D city model generation and urban environment simulation for cities worldwide. VoxCity's `generator' subpackage automatically downloads building heights, tree canopy heights, land cover, and terrain elevation within a specified target area, and voxelizes buildings, trees, land cover, and terrain to generate an integrated voxel city model. The `simulator' subpackage enables users to conduct environmental simulations, including solar radiation and view index analyses. Users can export the generated models using several file formats compatible with external software, such as ENVI-met (INX), Blender, and Rhino (OBJ). We generated 3D city models for eight global cities, and demonstrated the calculation of solar irradiance, sky view index, and green view index. We also showcased microclimate simulation and 3D rendering visualization through ENVI-met and Rhino, respectively, through the file export function. Additionally, we reviewed openly available geospatial data to create guidelines to help users choose appropriate data sources depending on their target areas and purposes. VoxCity can significantly reduce the effort and time required for 3D city model preparation and promote the utilization of urban environment simulations. This contributes to more informed urban and architectural design that considers environmental impacts, and in turn, fosters sustainable and livable cities. VoxCity is released openly at https://github.com/kunifujiwara/VoxCity.
中文: VoxCity是一个开源Python工具包,能自动生成三维体素城市模型并进行环境模拟,如太阳辐射分析,大幅降低了城市规划中的人工成本。
English: VoxCity is an open-source Python package that automates the generation of 3D voxel city models and enables environmental simulations like solar radiation analysis, significantly reducing manual effort in urban planning.

Authors:Deyu Cao, Samin Aref
Title: Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining
Abstract:
The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ's level. First, we look into combining existing quantization-aware training techniques with ApiQ's partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ's accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.
中文: 本文提出了一种新型超低位量化方法,通过显著性感知正则化在无需完全重新训练的情况下,将大型语言模型的资源效率提升的同时,相比ApiQ方法减少了超过7%的精度损失。
English: A novel ultra-low-bit quantization method with saliency-aware regularization is proposed to enhance resource efficiency of large language models while reducing accuracy degradation by over 7% compared to ApiQ, without requiring full retraining.

Authors:Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi
Title: Science Hierarchography: Hierarchical Organization of Science Literature
Abstract:
Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography
中文: 摘要提出了科学层级图谱方法,通过结合嵌入聚类与大语言模型提示的混合技术,将科学文献组织成多层次结构,以提升可解释性并为文献探索提供传统检索之外的替代途径。
English: The abstract introduces SCIENCE HIERARCHOGRAPHY, a method that organizes scientific literature into a hierarchical structure using a hybrid approach combining embedding-based clustering and LLM-based prompting to improve interpretability and exploration beyond traditional search methods.

Authors:Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi
Title: Science Hierarchography: Hierarchical Organization of Science Literature
Abstract:
Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography
中文: 摘要提出了科学层级图谱方法,通过结合嵌入聚类与大语言模型提示的混合技术,将科学文献组织成多层次结构,以提升可解释性并为文献探索提供传统检索之外的替代途径。
English: The abstract introduces SCIENCE HIERARCHOGRAPHY, a method that organizes scientific literature into a hierarchical structure using a hybrid approach combining embedding-based clustering and LLM-based prompting to improve interpretability and exploration beyond traditional search methods.

Authors:Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu
Title: Generative AI Act II: Test Time Scaling Drives Cognition Engineering
Abstract:
The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations such as knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering
中文摘要:第一代大语言模型通过规模扩展取得成功但存在知识滞后等局限,而新兴的第二代模型通过测试时扩展技术转变为思维构建引擎,实现了与AI的思维层面连接。
English Summary: The first generation of large language models achieved success through scaling but faced limitations like knowledge latency, while the emerging second generation transitions to thought-construction engines through test-time scaling, enabling mind-level AI connections.

Authors:Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, Gao Huang
Title: CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning
Abstract:
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.
中文摘要:CheXWorld是首个针对放射影像的自监督世界模型,它能同时建模局部解剖结构、全局解剖布局和领域变化这三个关键医学知识维度,在多项医学图像任务中显著优于现有方法。
English Summary: CheXWorld is a self-supervised world model for radiographic images that captures three key aspects of medical knowledge—local anatomical structures, global anatomical layouts, and domain variations—demonstrating superior performance in medical image classification and segmentation tasks compared to existing methods.

Authors:Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
Title: Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Abstract:
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs' perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs' recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
中文摘要:本研究首次揭示了大型语言模型在不同语言中识别知识边界的方式,发现其感知编码于模型中层,并提出无需训练的对齐方法可跨语言转移边界感知能力,有效降低低资源语言的幻觉风险。
English Summary: This study investigates how large language models perceive knowledge boundaries across different languages, revealing that such perceptions are encoded in specific model layers and can be transferred through a training-free alignment method to reduce hallucinations in low-resource languages.

Authors:Zhu Zhu, Shuo Jiang, Jingyuan Zheng, Yawen Li, Yifei Chen, Manli Zhao, Weizhong Gu, Feiwei Qin, Jinhu Wang, Gang Yu
Title: Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis
Abstract:
Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole-slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent accuracy. Existing automated whole-slide image classification methods encounter challenges such as poor interpretability, limited feature extraction capabilities, and high computational costs, restricting their practical clinical deployment. To overcome these limitations, we propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification, which enhances the Swin Transformer architecture by integrating a Kernel Activation Network within its multilayer perceptron and classification head modules, significantly improving both interpretability and accuracy. By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians' comprehensive approach, effectively capturing global and local tissue characteristics. Additionally, we introduce a heuristic soft voting mechanism guided by clinical insights to bridge patch-level predictions to whole-slide image-level classifications seamlessly. We verified the CMSwinKAN on the publicly available BreakHis dataset and the PpNTs dataset, which was established by our hospital. Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets. Our source code is available at https://github.com/JSLiam94/CMSwinKAN.
中文: CMSwinKAN模型通过多尺度特征融合和对比学习策略,在病理图像分类中实现了比现有方法更高的准确性和可解释性。
English: The proposed CMSwinKAN model enhances pathological image classification by integrating multi-scale feature fusion and contrastive learning, achieving superior accuracy and interpretability compared to existing methods.

Authors:Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry
Title: Learning to Attribute with Attention
Abstract:
Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .
中文摘要:提出的AT2方法通过将注意力权重作为特征进行学习,实现了语言模型中词元影响的高效归因,其性能与耗时的消融方法相当,同时显著提升了计算效率。
English Summary: The proposed AT2 method efficiently attributes token influence in language models by learning to use attention weights as features, achieving performance comparable to costly ablation-based approaches while significantly improving computational efficiency.

Authors:Paul K. Mandal, Cole Leo, Connor Hurley
Title: Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence
Abstract:
Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at https://github.com/PaulKMandal/CONTACT/.
中文摘要:CONTACT框架利用大型语言模型和最少监督从开源情报中预测领土控制情况,研究表明基于提示调优的BLOOMZ模型在低资源环境下优于小样本分类器,并能有效降低标注需求。
English Summary: The CONTACT framework utilizes large language models with minimal supervision to predict territorial control from open-source intelligence, demonstrating that prompt-tuned BLOOMZ outperforms few-shot classifiers in low-resource settings while reducing annotation needs.

Authors:Remko Proesmans, Thomas Lips, Francis wyffels
Title: Self-Mixing Laser Interferometry: In Search of an Ambient Noise-Resilient Alternative to Acoustic Sensing
Abstract:
Self-mixing interferometry (SMI) has been lauded for its sensitivity in detecting microvibrations, while requiring no physical contact with its target. Microvibrations, i.e., sounds, have recently been used as a salient indicator of extrinsic contact in robotic manipulation. In previous work, we presented a robotic fingertip using SMI for extrinsic contact sensing as an ambient-noise-resilient alternative to acoustic sensing. Here, we extend the validation experiments to the frequency domain. We find that for broadband ambient noise, SMI still outperforms acoustic sensing, but the difference is less pronounced than in time-domain analyses. For targeted noise disturbances, analogous to multiple robots simultaneously collecting data for the same task, SMI is still the clear winner. Lastly, we show how motor noise affects SMI sensing more so than acoustic sensing, and that a higher SMI readout frequency is important for future work. Design and data files are available at https://github.com/RemkoPr/icra2025-SMI-tactile-sensing.
中文: 自混合干涉测量法在宽频和针对性噪声环境下检测微振动优于声学传感,但受电机噪声影响更大,因此未来需提高读取频率。
English: Self-mixing interferometry (SMI) outperforms acoustic sensing in detecting microvibrations under broadband and targeted noise, though motor noise affects it more, highlighting the need for higher readout frequencies.

Authors:Mengyuan Li, Changhong Fu, Ziyu Lu, Zijie Zhang, Haobo Zuo, Liangliang Yao
Title: AnyTSR: Any-Scale Thermal Super-Resolution for UAV
Abstract:
Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.
中文: 本文提出AnyTSR方法,通过创新的图像编码器和任意尺度上采样器提升无人机热成像细节并减少伪影,在所有缩放比例下均优于现有技术。
English: This paper introduces AnyTSR, a novel any-scale thermal super-resolution method for UAVs that enhances image detail and reduces artifacts through an innovative encoder and upsampler, outperforming existing techniques across all scaling factors.

Authors:Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Title: EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model
Abstract:
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.
中文摘要:医学大型视觉语言模型在医疗领域潜力显著,但其依赖通用数据和粗粒度视觉理解限制了智能眼科诊断的应用;本文提出的Eyecare Kit通过定制数据集、评估基准和优化模型系统解决了这一难题,在多项眼科任务中实现了领先性能。
English Summary: Medical Large Vision-Language Models show promise in healthcare but face limitations in ophthalmic diagnosis due to reliance on general data and coarse visual understanding, which the proposed Eyecare Kit addresses with a tailored dataset, benchmark, and optimized model to achieve state-of-the-art performance.

Authors:Yushen He, Lei Zhao, Tianchen Deng, Zipeng Fang, Weidong Chen
Title: Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction
Abstract:
Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at https://github.com/TossherO/3D_Perception and its ROS inference version at https://github.com/TossherO/ros_packages.
Chinese: 本文提出了一种轻量级多模态框架,通过融合激光雷达和相机数据实现实时3D物体检测与轨迹预测,在有限计算资源下展现出卓越性能与部署能力。
English: This paper introduces a lightweight multi-modal framework that integrates LiDAR and camera data for real-time 3D object detection and trajectory prediction, achieving superior performance and deployability on limited computational resources.

Authors:Samuel Wertz, Arnaud Vandaele, Nicolas Gillis
Title: Efficient algorithms for the Hadamard decomposition
Abstract:
The Hadamard decomposition is a powerful technique for data analysis and matrix compression, which decomposes a given matrix into the element-wise product of two or more low-rank matrices. In this paper, we develop an efficient algorithm to solve this problem, leveraging an alternating optimization approach that decomposes the global non-convex problem into a series of convex sub-problems. To improve performance, we explore advanced initialization strategies inspired by the singular value decomposition (SVD) and incorporate acceleration techniques by introducing momentum-based updates. Beyond optimizing the two-matrix case, we also extend the Hadamard decomposition framework to support more than two low-rank matrices, enabling approximations with higher effective ranks while preserving computational efficiency. Finally, we conduct extensive experiments to compare our method with the existing gradient descent-based approaches for the Hadamard decomposition and with traditional low-rank approximation techniques. The results highlight the effectiveness of our proposed method across diverse datasets.
Chinese: 本文提出了一种高效的交替优化算法用于哈达玛分解,通过SVD启发的初始化和动量加速技术将其扩展至多矩阵分解,实验证明该方法在不同数据集上优于现有技术。
English: This paper introduces an efficient alternating optimization algorithm for Hadamard decomposition, extending it to multiple matrices with SVD-inspired initialization and momentum acceleration, demonstrating superior performance over existing methods in experiments.

Authors:Rohan P. Singh, Mitsuharu Morisawa, Mehdi Benallegue, Zhaoming Xie, Fumio Kanehiro
Title: Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning
Abstract:
For the deployment of legged robots in real-world environments, it is essential to develop robust locomotion control methods for challenging terrains that may exhibit unexpected deformability and irregularity. In this paper, we explore the application of sim-to-real deep reinforcement learning (RL) for the design of bipedal locomotion controllers for humanoid robots on compliant and uneven terrains. Our key contribution is to show that a simple training curriculum for exposing the RL agent to randomized terrains in simulation can achieve robust walking on a real humanoid robot using only proprioceptive feedback. We train an end-to-end bipedal locomotion policy using the proposed approach, and show extensive real-robot demonstration on the HRP-5P humanoid over several difficult terrains inside and outside the lab environment. Further, we argue that the robustness of a bipedal walking policy can be improved if the robot is allowed to exhibit aperiodic motion with variable stepping frequency. We propose a new control policy to enable modification of the observed clock signal, leading to adaptive gait frequencies depending on the terrain and command velocity. Through simulation experiments, we show the effectiveness of this policy specifically for walking over challenging terrains by controlling swing and stance durations. The code for training and evaluation is available online at https://github.com/rohanpsingh/LearningHumanoidWalking. Demo video is available at https://www.youtube.com/watch?v=ZgfNzGAkk2Q.
Chinese: 本文提出一种从仿真到现实的深度强化学习方法,通过简单训练课程和自适应步态控制,实现了人形机器人在复杂地形上的稳健双足行走,并在HRP-5P机器人上完成了实际环境验证。
English: This paper presents a sim-to-real deep reinforcement learning approach that enables robust bipedal locomotion on challenging terrains through a simple training curriculum and adaptive gait control, with successful real-world demonstrations on the HRP-5P humanoid robot.

Authors:Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen
Title: Compile Scene Graphs with Reinforcement Learning
Abstract:
Next-token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall -- which evaluate semantic and spatial alignment between predictions and ground truth at the object and relation levels. A format consistency reward further ensures that outputs follow the expected structural schema. Extensive experiments on the VG150 and PSG benchmarks show that R1-SGG substantially reduces failure rates and achieves strong performance in Recall and mean Recall, surpassing traditional SGG models and existing multimodal language models. Our code is available at https://github.com/gpt4vision/R1-SGG
中文摘要:R1-SGG是一种通过监督微调和基于图结构奖励的强化学习训练的多模态大模型,能够端到端生成结构化场景图,在视觉基准测试中显著优于传统场景图生成模型。
English Summary: R1-SGG is a multimodal language model trained with supervised fine-tuning and reinforcement learning using graph-centric rewards to generate structured scene graphs end-to-end, achieving superior performance on visual benchmarks compared to traditional methods.

Authors:Ritwik Mishra, Rajiv Ratn Shah, Ponnurangam Kumaraguru
Title: Long-context Non-factoid Question Answering in Indic Languages
Abstract:
Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4\% in semantic scores and 47\% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2\% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at https://github.com/ritwikmishra/IndicGenQA.
中文: 本研究证明,通过语境缩短技术可显著提升印度语言问答系统的性能,不仅提高了语义和词元级评分并降低计算开销,但也揭示了大语言模型在处理非事实性推理问题时的局限性。
English: This study demonstrates that context-shortening techniques significantly enhance question answering performance for Indic languages by improving semantic and token-level scores while reducing computational costs, though limitations remain with non-factoid questions.

Authors:Zahra Akhlaghi, Mostafa Haghir Chehreghani
Title: Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation
Abstract:
The rapid growth of the internet has made personalized recommendation systems indispensable. Graph-based sequential recommendation systems, powered by Graph Neural Networks (GNNs), effectively capture complex user-item interactions but often face challenges such as noise and static representations. In this paper, we introduce the Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation (ALDA4Rec) method, a novel model that constructs an item-item graph, filters noise through community detection, and enriches user-item interactions. Graph Convolutional Networks (GCNs) are then employed to learn short-term representations, while averaging, GRUs, and attention mechanisms are utilized to model long-term embeddings. An MLP-based adaptive weighting strategy is further incorporated to dynamically optimize long-term user preferences. Experiments conducted on four real-world datasets demonstrate that ALDA4Rec outperforms state-of-the-art baselines, delivering notable improvements in both accuracy and robustness. The source code is available at https://github.com/zahraakhlaghi/ALDA4Rec.
中文:ALDA4Rec模型通过构建项目关系图、社区检测降噪,并结合图卷积网络与自适应加权策略动态优化长期用户偏好,在多个真实数据集上实现了精度与鲁棒性的显著提升。
English: The ALDA4Rec model enhances recommendation systems by constructing item-item graphs, filtering noise via community detection, and using GCNs with adaptive weighting to dynamically optimize long-term user preferences, achieving superior accuracy and robustness across multiple datasets.

Authors:Yuhao Liu, Teng Fu, Jie Fan, Panpan Niu, Chaowen Deng, Zhongyi Huang
Title: Capacity-achieving sparse superposition codes with spatially coupled VAMP decoder
Abstract:
Sparse superposition (SS) codes provide an efficient communication scheme over the Gaussian channel, utilizing the vector approximate message passing (VAMP) decoder for rotational invariant design matrices. Previous work has established that the VAMP decoder for SS achieves Shannon capacity when the design matrix satisfies a specific spectral criterion and exponential decay power allocation is used. In this work, we propose a spatially coupled VAMP (SC-VAMP) decoder for SS with spatially coupled design matrices. Based on state evolution (SE) analysis, we demonstrate that the SC-VAMP decoder is capacity-achieving when the design matrices satisfy the spectra criterion. Empirically, we show that the SC-VAMP decoder outperforms the VAMP decoder with exponential decay power allocation, achieving a lower section error rate. All codes are available on https://github.com/yztfu/SC-VAMP-for-Superposition-Code.git.
Chinese: 针对具有空间耦合设计矩阵的稀疏叠加码,SC-VAMP解码器在满足频谱准则时可达香农容量,并通过降低分段错误率实证优于采用指数衰减功率分配的VAMP解码器。
English: The SC-VAMP decoder for sparse superposition codes with spatially coupled design matrices achieves Shannon capacity under spectral criteria and outperforms the VAMP decoder with exponential decay power allocation by reducing section error rates.

Authors:Jun Zeng, KC Santosh, Deepak Rajan Nayak, Thomas de Lange, Jonas Varkey, Tyler Berzin, Debesh Jha
Title: FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention
Abstract:
Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.
Chinese: FocusNet通过集成交叉语义交互解码器、细节增强模块和聚焦注意力模块,在多模态结肠镜数据集上实现了82.09%-93.42%的优异分割精度,显著提升了息肉分割的鲁棒性和临床适用性。
English: FocusNet, a Transformer-enhanced network with specialized modules, significantly improves polyp segmentation accuracy and robustness across multiple imaging modalities, outperforming existing methods with dice coefficients ranging from 82.09% to 93.42%.

Authors:Shuobin Wei, Zhuang Zhou, Zhengan Lu, Zizhao Yuan, Binghua Su
Title: HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework
Abstract:
In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state-of-the-art performance on the NYUDepthv2 and SUN-RGBD datasets. The code is available at: https://github.com/Weishuobin/HDBFormer.
中文: HDBFormer提出了一种异构双分支框架,通过分别处理RGB和深度图像的特性,并利用模态信息交互模块,在室内场景语义分割中实现了领先的性能。
English: HDBFormer introduces a heterogeneous dual-branch framework with specialized encoders for RGB and depth modalities, enhanced by a Modality Information Interaction Module to achieve state-of-the-art indoor scene segmentation performance.

Authors:Yang Wu, Yun Zhu, Kaihua Zhang, Jianjun Qian, Jin Xie, Jian Yang
Title: WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion
Abstract:
3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity of the generated data is quite limited. This paper presents WeatherGen, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which can provide a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle or central ray, excellently maintaining the physical structure of the LiDAR data. Subsequently, following the generator to transfer real-world knowledge, we design a latent feature aligner. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision, guiding the diffusion model to generate more discriminative data. Extensive evaluations demonstrate the high generation quality of WeatherGen. Through WeatherGen, we construct the mini-weather dataset, promoting the performance of the downstream task under adverse weather conditions. Code is available: https://github.com/wuyang98/weathergen
中文:WeatherGen是一个统一的扩散框架,通过蜘蛛曼巴生成器和对比学习生成高保真、多天气的激光雷达数据,显著提升了恶劣天气下的三维感知性能。
English: WeatherGen is a unified diffusion framework that generates high-fidelity, diverse-weather LiDAR data using a spider mamba generator and contrastive learning, significantly improving 3D perception in adverse conditions.

Authors:Yihao Ouyang, Xunheng Kuang, Mengjia Xiong, Zhida Wang, Yuanquan Wang
Title: A Novel Hybrid Approach for Retinal Vessel Segmentation with Dynamic Long-Range Dependency and Multi-Scale Retinal Edge Fusion Enhancement
Abstract:
Accurate retinal vessel segmentation provides essential structural information for ophthalmic image analysis. However, existing methods struggle with challenges such as multi-scale vessel variability, complex curvatures, and ambiguous boundaries. While Convolutional Neural Networks (CNNs), Transformer-based models and Mamba-based architectures have advanced the field, they often suffer from vascular discontinuities or edge feature ambiguity. To address these limitations, we propose a novel hybrid framework that synergistically integrates CNNs and Mamba for high-precision retinal vessel segmentation. Our approach introduces three key innovations: 1) The proposed High-Resolution Edge Fuse Network is a high-resolution preserving hybrid segmentation framework that combines a multi-scale backbone with the Multi-scale Retina Edge Fusion (MREF) module to enhance edge features, ensuring accurate and robust vessel segmentation. 2) The Dynamic Snake Visual State Space block combines Dynamic Snake Convolution with Mamba to adaptively capture vessel curvature details and long-range dependencies. An improved eight-directional 2D Snake-Selective Scan mechanism and a dynamic weighting strategy enhance the perception of complex vascular topologies. 3) The MREF module enhances boundary precision through multi-scale edge feature aggregation, suppressing noise while emphasizing critical vessel structures across scales. Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance, particularly in maintaining vascular continuity and effectively segmenting vessels in low-contrast regions. This work provides a robust method for clinical applications requiring accurate retinal vessel analysis. The code is available at https://github.com/frank-oy/HREFNet.
中文: 本文提出了一种结合CNN和Mamba的新型混合框架,通过三项关键创新解决了视网膜血管分割中的多尺度变化和边界模糊等难题,在保持血管连续性和低对比度区域分割方面实现了最先进的性能。
English: This paper introduces a novel hybrid framework combining CNNs and Mamba to overcome retinal vessel segmentation challenges like multi-scale variability and ambiguous boundaries, achieving state-of-the-art performance through three key innovations that enhance edge features and vascular continuity.

Authors:Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu
Title: Beyond One-Hot Labels: Semantic Mixing for Model Calibration
Abstract:
Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of one-hot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we present \textbf{Calibration-aware Semantic Mixing (CSM)}, a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code is \href{https://github.com/E-Galois/CSM}{available here}.
中文摘要:本文提出了一种校准感知的数据增强方法CSM,通过扩散模型生成具有混合类别特征的合成样本并标注精确置信度,从而超越传统独热编码数据集的局限,显著提升了模型校准性能。
English Summary: This paper introduces a calibration-aware data augmentation method called CSM, which generates synthetic datasets with mixed-class samples and precise confidence scores using diffusion models to improve model calibration beyond traditional one-hot labeled datasets.

Authors:Jinhao Li, Zijian Chen, Tingzhu Chen, Zhiji Liu, Changbo Wang
Title: OBIFormer: A Fast Attentive Denoising Framework for Oracle Bone Inscriptions
Abstract:
Oracle bone inscriptions (OBIs) are the earliest known form of Chinese characters and serve as a valuable resource for research in anthropology and archaeology. However, most excavated fragments are severely degraded due to thousands of years of natural weathering, corrosion, and man-made destruction, making automatic OBI recognition extremely challenging. Previous methods either focus on pixel-level information or utilize vanilla transformers for glyph-based OBI denoising, which leads to tremendous computational overhead. Therefore, this paper proposes a fast attentive denoising framework for oracle bone inscriptions, i.e., OBIFormer. It leverages channel-wise self-attention, glyph extraction, and selective kernel feature fusion to reconstruct denoised images precisely while being computationally efficient. Our OBIFormer achieves state-of-the-art denoising performance for PSNR and SSIM metrics on synthetic and original OBI datasets. Furthermore, comprehensive experiments on a real oracle dataset demonstrate the great potential of our OBIFormer in assisting automatic OBI recognition. The code will be made available at https://github.com/LJHolyGround/OBIFormer.
中文摘要:本文提出OBIFormer这一快速注意力去噪框架,通过通道自注意力与字形提取技术高效修复受损甲骨文,在保持计算效率的同时实现了最优去噪性能。
English Summary: This paper introduces OBIFormer, a fast attentive denoising framework that effectively restores degraded oracle bone inscriptions using channel-wise self-attention and glyph extraction, achieving state-of-the-art performance while maintaining computational efficiency.

Authors:Yipeng Sun, Linda-Sophie Schneider, Mingxuan Gu, Siyuan Mei, Chengze Ye, Fabian Wagner, Siming Bayer, Andreas Maier
Title: Filter2Noise: Interpretable Self-Supervised Single-Image Denoising for Low-Dose CT with Attention-Guided Bilateral Filtering
Abstract:
Effective denoising is crucial in low-dose CT to enhance subtle structures and low-contrast lesions while preventing diagnostic errors. Supervised methods struggle with limited paired datasets, and self-supervised approaches often require multiple noisy images and rely on deep networks like U-Net, offering little insight into the denoising mechanism. To address these challenges, we propose an interpretable self-supervised single-image denoising framework -- Filter2Noise (F2N). Our approach introduces an Attention-Guided Bilateral Filter that adapted to each noisy input through a lightweight module that predicts spatially varying filter parameters, which can be visualized and adjusted post-training for user-controlled denoising in specific regions of interest. To enable single-image training, we introduce a novel downsampling shuffle strategy with a new self-supervised loss function that extends the concept of Noise2Noise to a single image and addresses spatially correlated noise. On the Mayo Clinic 2016 low-dose CT dataset, F2N outperforms the leading self-supervised single-image method (ZS-N2N) by 4.59 dB PSNR while improving transparency, user control, and parametric efficiency. These features provide key advantages for medical applications that require precise and interpretable noise reduction. Our code is demonstrated at https://github.com/sypsyp97/Filter2Noise.git .
中文摘要:提出的Filter2Noise框架通过引入可调节参数的注意力引导双边滤波器,实现了可解释的单图像自监督CT去噪,在性能和透明度方面均优于现有方法。
English Summary: The proposed Filter2Noise framework introduces an interpretable self-supervised method for single-image CT denoising using an attention-guided bilateral filter with adjustable parameters, achieving superior performance and transparency compared to existing approaches.

Authors:Jianing Wang, Jin Jiang, Yang Liu, Mengdi Zhang, Xunliang Cai
Title: Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
Abstract:
In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at https://github.com/wjn1996/Prejudge-Before-Think.
中文: 本文提出了一种LLM推理中的"过程预判"策略,通过动态树搜索的自动化框架和两阶段训练机制,使模型能够在思考前预判潜在错误,显著提升了复杂推理能力。
English: This paper proposes a "process prejudge" strategy for LLMs that enables adaptive error anticipation during reasoning, implemented through an automated framework with dynamic tree-searching and a two-phase training mechanism, significantly boosting complex reasoning performance.

Authors:Alex Ergasti, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Title: U-Shape Mamba: State Space Model for faster diffusion
Abstract:
Diffusion models have become the most popular approach for high-quality image generation, but their high computational cost still remains a significant challenge. To address this problem, we propose U-Shape Mamba (USM), a novel diffusion model that leverages Mamba-based layers within a U-Net-like hierarchical structure. By progressively reducing sequence length in the encoder and restoring it in the decoder through Mamba blocks, USM significantly lowers computational overhead while maintaining strong generative capabilities. Experimental results against Zigma, which is currently the most efficient Mamba-based diffusion model, demonstrate that USM achieves one-third the GFlops, requires less memory and is faster, while outperforming Zigma in image quality. Frechet Inception Distance (FID) is improved by 15.3, 0.84 and 2.7 points on AFHQ, CelebAHQ and COCO datasets, respectively. These findings highlight USM as a highly efficient and scalable solution for diffusion-based generative models, making high-quality image synthesis more accessible to the research community while reducing computational costs.
Chinese: USM是一种新型扩散模型,采用类似U-Net结构和Mamba层,在显著降低三分之二计算量、减少内存需求并加快速度的同时,图像质量超越了当前最高效的Mamba模型Zigma。
English: USM is a novel diffusion model that uses a U-Net-like structure with Mamba layers to drastically cut computational costs by a third, reduce memory usage, and speed up processing while enhancing image quality over the leading efficient model, Zigma.

Authors:Ziqi Zhao, Zhaochun Ren, Jiyuan Yang, Zuming Yan, Zihan Wang, Liu Yang, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Xin Xin
Title: Improving Sequential Recommenders through Counterfactual Augmentation of System Exposure
Abstract:
In sequential recommendation (SR), system exposure refers to items that are exposed to the user. Typically, only a few of the exposed items would be interacted with by the user. Although SR has achieved great success in predicting future user interests, existing SR methods still fail to fully exploit system exposure data. Most methods only model items that have been interacted with, while the large volume of exposed but non-interacted items is overlooked. Even methods that consider the whole system exposure typically train the recommender using only the logged historical system exposure, without exploring unseen user interests. In this paper, we propose counterfactual augmentation over system exposure for sequential recommendation (CaseRec). To better model historical system exposure, CaseRec introduces reinforcement learning to account for different exposure rewards. CaseRec uses a decision transformer-based sequential model to take an exposure sequence as input and assigns different rewards according to the user feedback. To further explore unseen user interests, CaseRec proposes to perform counterfactual augmentation, where exposed original items are replaced with counterfactual items. Then, a transformer-based user simulator is proposed to predict the user feedback reward for the augmented items. Augmentation, together with the user simulator, constructs counterfactual exposure sequences to uncover new user interests. Finally, CaseRec jointly uses the logged exposure sequences with the counterfactual exposure sequences to train a decision transformer-based sequential model for generating recommendation. Experiments on three real-world benchmarks show the effectiveness of CaseRec. Our code is available at https://github.com/ZiqiZhao1/CaseRec.
中文: 本文提出CaseRec序列推荐方法,通过强化学习和反事实增强技术,结合决策变换器和用户模拟器来探索未观察到的用户兴趣,从而提升系统曝光数据的利用效率和推荐效果。
English: This paper introduces CaseRec, a sequential recommendation method that enhances system exposure modeling through reinforcement learning and counterfactual augmentation, using decision transformers and a user simulator to explore unseen user interests and improve recommendation performance.

Authors:Chenwei Yan, Xiangling Fu, Yuxuan Xiong, Tianyi Wang, Siu Cheung Hui, Ji Wu, Xien Liu
Title: LLM Sensitivity Evaluation Framework for Clinical Diagnosis
Abstract:
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
中文: 当前大语言模型在临床诊断中对关键医学信息的敏感性存在不足,需提升其可靠性和关键信息感知能力,以增强人类信任并促进实际应用。
English: Current large language models exhibit limitations in maintaining sensitivity to crucial medical information for clinical diagnosis, necessitating improvements in reliability and key information awareness to enhance trust and practical application.

Authors:Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, Shutao Li
Title: Learning from Noisy Pseudo-labels for All-Weather Land Cover Mapping
Abstract:
Semantic segmentation of SAR images has garnered significant attention in remote sensing due to the immunity of SAR sensors to cloudy weather and light conditions. Nevertheless, SAR imagery lacks detailed information and is plagued by significant speckle noise, rendering the annotation or segmentation of SAR images a formidable task. Recent efforts have resorted to annotating paired optical-SAR images to generate pseudo-labels through the utilization of an optical image segmentation network. However, these pseudo-labels are laden with noise, leading to suboptimal performance in SAR image segmentation. In this study, we introduce a more precise method for generating pseudo-labels by incorporating semi-supervised learning alongside a novel image resolution alignment augmentation. Furthermore, we introduce a symmetric cross-entropy loss to mitigate the impact of noisy pseudo-labels. Additionally, a bag of training and testing tricks is utilized to generate better land-cover mapping results. Our experiments on the GRSS data fusion contest indicate the effectiveness of the proposed method, which achieves first place. The code is available at https://github.com/StuLiu/DFC2025Track1.git.
中文摘要:本研究提出了一种结合半监督学习和图像分辨率对齐的改进方法,用于生成SAR图像分割的伪标签,并在GRSS数据融合竞赛中取得了最优成绩。
English Summary: This study introduces an enhanced method for generating pseudo-labels for SAR image segmentation using semi-supervised learning and image resolution alignment, achieving top performance in the GRSS data fusion contest.

Authors:Saksham Rastogi, Pratyush Maini, Danish Pruthi
Title: STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Abstract:
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
中文摘要:STAMP框架通过生成带独特水印的多个改写版本,并比较公开与私有版本间的模型似然度,使数据创建者能够检测其内容是否被用于大型语言模型的预训练数据中。
English Summary: The STAMP framework enables data creators to detect if their content was used in training large language models by watermarking multiple rephrased versions and comparing model likelihoods between public and private copies.

Authors:Shimou Ling, Liang Zhang, Jiangwei Zhao, Lili Pan, Hongliang Li
Title: LoRA-Based Continual Learning with Constraints on Critical Parameter Changes
Abstract:
LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we directly propose freezing the most critical parameter matrices in the Vision Transformer (ViT) for pre-tasks before learning post-tasks. In addition, building on orthogonal LoRA tuning, we propose orthogonal LoRA composition (LoRAC) based on QR decomposition, which may further enhance the plasticity of our method. Elaborate ablation studies and extensive comparisons demonstrate the effectiveness of our proposed method. Our results indicate that our method achieves state-of-the-art (SOTA) performance on several well-known continual learning benchmarks. For instance, on the Split CIFAR-100 dataset, our method shows a 6.35\% improvement in accuracy and a 3.24\% reduction in forgetting compared to previous methods. Our code is available at https://github.com/learninginvision/LoRAC-IPC.
Chinese: 本文提出一种在视觉Transformer中冻结关键参数矩阵并结合正交LoRA组合的方法,有效提升了持续学习性能,在Split CIFAR-100等基准测试中实现了最优的准确率提升和遗忘减少。
English: This paper introduces a method that freezes key parameter matrices in Vision Transformers and employs orthogonal LoRA composition to enhance continual learning, achieving state-of-the-art performance with significant improvements in accuracy and reduced forgetting on benchmarks like Split CIFAR-100.

Authors:Yeongjun Jang, Joowon Lee, Junsoo Kim
Title: Documentation on Encrypted Dynamic Control Simulation Code using Ring-LWE based Cryptosystems
Abstract:
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which supports an efficient implementation of Ring-Learing With Errors (LWE) based encrypted controllers, and our explanations are assisted with example codes that are fully available at https://github.com/CDSL-EncryptedControl/CDSL.
中文摘要:本文通过开源Lattigo库提供了运行加密控制器的实用指南,该库支持基于环LWE的高效加密控制器实现,并附带完整可用的示例代码。
English Summary: This paper provides a practical guide for implementing encrypted controllers using the open-source Lattigo library, which enables efficient Ring-LWE-based secure computation while addressing increased computational demands.

Authors:Shashank Shriram, Srinivasa Perisetla, Aryan Keskar, Harsha Krishnaswamy, Tonko Emil Westerhof Bossen, Andreas Møgelmose, Ross Greer
Title: Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety
Abstract:
Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git
中文摘要:本文提出了一种融合视觉语言推理与零样本目标检测的多模态方法,通过突破预定义类别的限制并提升定位精度,来改进自动驾驶中的危险识别能力。
English Summary: This paper introduces a multimodal system combining vision-language reasoning and zero-shot object detection to enhance autonomous driving hazard identification by overcoming limitations of predefined categories and improving localization accuracy.

Authors:Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou
Title: Cost-of-Pass: An Economic Framework for Evaluating Language Models
Abstract:
The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.
中文摘要:本研究提出基于“通过成本”的经济评估框架,发现轻量级、大型和推理模型分别在基础定量、知识密集和复杂定量任务中具有最优成本效益,且模型层面的创新是推动成本效率提升的关键因素。
English Summary: The study introduces an economic framework using "cost-of-pass" metrics to evaluate AI systems, revealing that lightweight, large, and reasoning models each excel in specific tasks, with model-level innovations being the main driver of cost-efficiency improvements.

Authors:Omar Tsai, Jianing Li, Tsz Tung Cheung, Lejing Huang, Hao Zhu, Jianrui Xiao, Iman Sharafaldin, Mohammad A. Tayebi
Title: GraphQLer: Enhancing GraphQL Security with Context-Aware API Testing
Abstract:
GraphQL is an open-source data query and manipulation language for web applications, offering a flexible alternative to RESTful APIs. However, its dynamic execution model and lack of built-in security mechanisms expose it to vulnerabilities such as unauthorized data access, denial-of-service (DoS) attacks, and injections. Existing testing tools focus on functional correctness, often overlooking security risks stemming from query interdependencies and execution context. This paper presents GraphQLer, the first context-aware security testing framework for GraphQL APIs. GraphQLer constructs a dependency graph to analyze relationships among mutations, queries, and objects, capturing critical interdependencies. It chains related queries and mutations to reveal authentication and authorization flaws, access control bypasses, and resource misuse. Additionally, GraphQLer tracks internal resource usage to uncover data leakage, privilege escalation, and replay attack vectors. We assess GraphQLer on various GraphQL APIs, demonstrating improved testing coverage - averaging a 35% increase, with up to 84% in some cases - compared to top-performing baselines. Remarkably, this is achieved in less time, making GraphQLer suitable for time-sensitive contexts. GraphQLer also successfully detects a known CVE and potential vulnerabilities in large-scale production APIs. These results underline GraphQLer's utility in proactively securing GraphQL APIs through automated, context-aware vulnerability detection.
GraphQLer是首个上下文感知的GraphQL安全测试框架,通过构建依赖图分析查询关联性,能高效发现权限漏洞和数据泄露问题,显著提升检测覆盖率。
GraphQLer is a novel context-aware security testing framework that enhances GraphQL API protection by analyzing query interdependencies and detecting vulnerabilities like unauthorized access and data leakage with improved efficiency and coverage.

Authors:Jiang-Xin Shi, Tong Wei, Yu-Feng Li
Title: LIFT+: Lightweight Fine-Tuning for Long-Tail Learning
Abstract:
The fine-tuning paradigm has emerged as a prominent approach for addressing long-tail learning tasks in the era of foundation models. However, the impact of fine-tuning strategies on long-tail learning performance remains unexplored. In this work, we disclose that existing paradigms exhibit a profound misuse of fine-tuning methods, leaving significant room for improvement in both efficiency and accuracy. Specifically, we reveal that heavy fine-tuning (fine-tuning a large proportion of model parameters) can lead to non-negligible performance deterioration on tail classes, whereas lightweight fine-tuning demonstrates superior effectiveness. Through comprehensive theoretical and empirical validation, we identify this phenomenon as stemming from inconsistent class conditional distributions induced by heavy fine-tuning. Building on this insight, we propose LIFT+, an innovative lightweight fine-tuning framework to optimize consistent class conditions. Furthermore, LIFT+ incorporates semantic-aware initialization, minimalist data augmentation, and test-time ensembling to enhance adaptation and generalization of foundation models. Our framework provides an efficient and accurate pipeline that facilitates fast convergence and model compactness. Extensive experiments demonstrate that LIFT+ significantly reduces both training epochs (from $\sim$100 to $\leq$15) and learned parameters (less than 1%), while surpassing state-of-the-art approaches by a considerable margin. The source code is available at https://github.com/shijxcs/LIFT-plus.
中文: 该研究揭示过度微调会损害尾部类别性能,并提出LIFT+轻量框架,通过大幅减少训练轮次和参数,显著提升模型效率与精度。
English: The study reveals that heavy fine-tuning impairs performance on tail classes, while proposing LIFT+, a lightweight framework that enhances efficiency and accuracy by reducing training epochs and parameters significantly.

Authors:Zichao Yue, Chenhui Deng, Zhiru Zhang
Title: Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs
Abstract:
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue. Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored. This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.
中文: 本文系统分析了预传播图神经网络的理论优势,发现数据加载和输入扩展是实际瓶颈,并提出优化方案使训练吞吐量比基线平均提升15倍,在大型图基准上比基于采样的方法快达两个数量级。
English: This paper characterizes pre-propagation GNNs (PP-GNNs) as theoretically addressing neighbor explosion but identifies data loading and input expansion as practical bottlenecks, proposing optimizations that achieve up to 15× throughput improvement over baselines and orders-of-magnitude speedup versus sampling-based GNNs.

Authors:Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou
Title: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Abstract:
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
Chinese: 本文提出的领域影响感知数据采样(DIDS)方法通过梯度聚类确保领域内一致性,并利用费舍尔信息矩阵度量领域影响,在实验中实现了3.4%的性能提升,同时保持训练效率。
English: The paper introduces Domain Impact-aware Data Sampling (DIDS), which optimizes domain sampling for large language models by using gradient clustering for intra-domain consistency and a Fisher Information Matrix metric to measure domain impact, achieving a 3.4% performance boost in experiments.

Authors:Wenhua Wu, Tong Zhao, Chensheng Peng, Lei Yang, Yintao Wei, Zhe Liu, Hesheng Wang
Title: BEV-GS: Feed-forward Gaussian Splatting in Bird's-Eye-View for Road Reconstruction
Abstract:
Road surface is the sole contact medium for wheels or robot feet. Reconstructing road surface is crucial for unmanned vehicles and mobile robots. Recent studies on Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have achieved remarkable results in scene reconstruction. However, they typically rely on multi-view image inputs and require prolonged optimization times. In this paper, we propose BEV-GS, a real-time single-frame road surface reconstruction method based on feed-forward Gaussian splatting. BEV-GS consists of a prediction module and a rendering module. The prediction module introduces separate geometry and texture networks following Bird's-Eye-View paradigm. Geometric and texture parameters are directly estimated from a single frame, avoiding per-scene optimization. In the rendering module, we utilize grid Gaussian for road surface representation and novel view synthesis, which better aligns with road surface characteristics. Our method achieves state-of-the-art performance on the real-world dataset RSRD. The road elevation error reduces to 1.73 cm, and the PSNR of novel view synthesis reaches 28.36 dB. The prediction and rendering FPS is 26, and 2061, respectively, enabling high-accuracy and real-time applications. The code will be available at: \href{https://github.com/cat-wwh/BEV-GS}{\texttt{https://github.com/cat-wwh/BEV-GS}}
中文摘要:本文提出BEV-GS方法,通过前馈高斯泼溅实现单帧路面实时重建,在真实数据集上达到1.73厘米高程误差的最佳性能,预测帧率达26 FPS,满足高精度实时应用需求。
English Summary: This paper introduces BEV-GS, a real-time single-frame road surface reconstruction method using feed-forward Gaussian splatting that achieves state-of-the-art accuracy with 1.73 cm elevation error and real-time performance at 26 FPS for prediction.

Authors:Liujianfu Wang, Xinyi Long, Yuyang Du, Xiaoyan Liu, Kexin Chen, Soung Chang Liew
Title: Cellular-X: An LLM-empowered Cellular Agent for Efficient Base Station Operations
Abstract:
This paper introduces Cellular-X, an LLM-powered agent designed to automate cellular base station (BS) maintenance. Leveraging multimodal LLM and retrieval-augmented generation (RAG) techniques, Cellular-X significantly enhances field engineer efficiency by quickly interpreting user intents, retrieving relevant technical information, and configuring a BS through iterative self-correction. Key features of the demo include automatic customized BS setup, document-based query answering, and voice-controlled configuration reporting and revision. We implemented Cellular-X on a USRP X310 testbed for demonstration. Demo videos and implementation details are available at https://github.com/SeaBreezing/Cellular-X.
Chinese: 本文介绍了Cellular-X,一种基于大语言模型的智能代理,通过多模态LLM和检索增强生成技术,自动执行基站维护任务,能快速解析用户意图、检索技术信息并进行自我修正配置,显著提升现场工程师效率。
English: This paper presents Cellular-X, an LLM-powered agent that automates cellular base station maintenance using multimodal LLM and RAG techniques to improve engineer efficiency through intent interpretation, technical retrieval, and self-correcting configuration.

Authors:Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer
Title: Perception Encoder: The best visual embeddings are not at the output of the network
Abstract:
We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models
中文: 感知编码器(PE)是一种通过对比性视觉-语言学习训练的最先进视觉编码器,借助对齐方法提取隐藏嵌入,在分类、问答和空间预测等多种任务中均取得了顶尖成果。
English: The Perception Encoder (PE) is a state-of-the-art vision encoder trained with contrastive vision-language learning, achieving top-tier results across diverse tasks like classification, Q&A, and spatial predictions by extracting hidden embeddings through alignment methods.

Authors:Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
Title: PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Abstract:
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models
中文: 本文提出完全开放可复现的感知语言模型框架,避免使用专有模型蒸馏,发布280万人工标注视频数据以弥补关键数据缺口,并建立PLM-VideoBench评估体系来推动透明化的视频理解研究。
English: This paper introduces a fully open and reproducible Perception Language Model framework that avoids proprietary model distillation, releases 2.8M human-labeled video annotations to address data gaps, and establishes PLM-VideoBench for comprehensive video understanding evaluation.

Authors:Fei Shen, Jian Yu, Cong Wang, Xin Jiang, Xiaoyu Du, Jinhui Tang
Title: IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design
Abstract:
This paper presents IMAGGarment, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. Code, models, and datasets are publicly available at https://github.com/muzishen/IMAGGarment.
中文摘要:IMAGGarment是一种细粒度服装生成框架,通过两阶段训练策略分别建模全局外观和局部细节,实现了对轮廓、颜色和徽标位置的高精度控制,在结构稳定性和局部可控性方面优于现有方法。
English Summary: IMAGGarment is a fine-grained garment generation framework that enables high-fidelity clothing synthesis with precise control over silhouette, color, and logo placement through a two-stage training approach, outperforming existing methods in structural stability and local controllability.

Authors:Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez
Title: Sleep-time Compute: Beyond Inference Scaling at Test-time
Abstract:
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
中文: 睡眠时间计算让大语言模型能够离线预计算响应,将测试时计算需求降低最多5倍,准确率最高提升18%,并通过分摊相关查询成本使单次查询开销显著减少。
English: Sleep-time compute enables large language models to pre-compute responses offline, reducing test-time compute by up to 5x and improving accuracy by up to 18% while cutting per-query costs through amortization across related queries.

Authors:Kaiyue Sun, Xian Liu, Yao Teng, Xihui Liu
Title: Personalized Text-to-Image Generation with Auto-Regressive Models
Abstract:
Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
中文摘要:本文通过两阶段训练策略优化自回归模型进行个性化图像合成,实现了与主流扩散模型相当的主体保真度和提示跟随能力。
English Summary: This paper explores optimizing auto-regressive models for personalized image synthesis through a two-stage training approach, achieving performance comparable to leading diffusion-based methods.

Authors:Andrew Melnik, Benjamin Alt, Giang Nguyen, Artur Wilkowski, Maciej Stefańczyk, Qirui Wu, Sinan Harms, Helge Rhodin, Manolis Savva, Michael Beetz
Title: Digital Twin Generation from Visual Data: A Survey
Abstract:
This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins
本调查全面综述了从视频生成数字孪生的前沿方法,分析了多种技术及其应用,并探讨了关键挑战与未来研究方向。
This survey provides a comprehensive overview of state-of-the-art methods for creating digital twins from videos, analyzing various techniques and their applications while addressing key challenges and future research directions.

Authors:Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, Liujuan Cao
Title: Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs
Abstract:
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.
Chinese: 本文提出了一种免训练框架,通过从3D高斯基元构建超点图来实现高效且视图一致的开集词汇3D场景理解,在速度和分割精度上均显著优于现有方法。
English: This paper introduces a training-free framework that constructs a superpoint graph from 3D Gaussian primitives to achieve efficient and view-consistent open-vocabulary 3D scene understanding, significantly outperforming existing methods in both speed and segmentation accuracy.

Authors:Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong, Zhengzhong Tu, Yufan Liu, Xiangguang Chen, Zuowei Cao, Minhao Tang, Shan Liu, Kexin Zhang, Jingfen Xie, Yan Wang, Kai Chen, Shijie Zhao, Yunchen Zhang, Xiangkai Xu, Hong Gao, Ji Shi, Yiming Bao, Xiugang Dong, Xiangsheng Zhou, Yaofeng Tu, Ying Liang, Yiwen Wang, Xinning Chai, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song, Wei Sun, Kang Fu, Linhan Cao, Dandan Zhu, Kaiwei Zhang, Yucheng Zhu, Zicheng Zhang, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Zhi Jin, Jiawei Wu, Wei Wang, Wenjian Zhang, Yuhai Lan, Gaoxiong Yi, Hengyuan Na, Wang Luo, Di Wu, MingYin Bai, Jiawang Du, Zilong Lu, Zhenyu Jiang, Hui Zeng, Ziguan Cui, Zongliang Gan, Guijin Tang, Xinglin Xie, Kehuan Song, Xiaoqiang Lu, Licheng Jiao, Fang Liu, Xu Liu, Puhua Chen, Ha Thu Nguyen, Katrien De Moor, Seyed Ali Amirshahi, Mohamed-Chaker Larabi, Qi Tang, Linfeng He, Zhiyong Gao, Zixuan Gao, Guohua Zhang, Zhiye Huang, Yi Deng, Qingmiao Jiang, Lu Chen, Yi Yang, Xi Liao, Nourine Mohammed Nadir, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Meiqin Liu, Chao Yao, Yao Zhao
Title: NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
中文: 本文综述了NTIRE 2025挑战赛,该赛事通过高效视频质量评估和基于扩散的超分辨率两个赛道,专注于提升快手、TikTok等平台的短格式用户生成内容视频质量与用户体验,共吸引266名参与者并收到18份有效提交。
English: This paper reviews the NTIRE 2025 Challenge focusing on short-form UGC video quality assessment and enhancement through two tracks—efficient VQA and diffusion-based super-resolution—aimed at improving user experience on platforms like Kwai and TikTok, attracting 266 participants and 18 valid submissions.

Authors:Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei
Title: VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Abstract:
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.
中文摘要:VistaDPO是一种新颖的视频层次时空直接偏好优化框架,通过实例、时序和感知三个层级优化文本-视频偏好对齐,并构建了包含7200对问答数据的新数据集,有效解决了大型视频模型中的视频-语言错位和幻觉问题,显著提升了多项基准测试性能。
English Summary: VistaDPO is a novel framework that addresses video-language misalignment and hallucination in Large Video Models by optimizing text-video preference alignment across instance, temporal, and perceptive levels, using a newly constructed 7.2K QA dataset to significantly enhance model performance on multiple benchmarks.

Authors:Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li
Title: EventVAD: Training-Free Event-Aware Video Anomaly Detection
Abstract:
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
中文摘要:EventVAD是一种事件感知的视频异常检测框架,通过结合动态图架构和多模态大语言模型进行时序事件推理,在无需训练的场景下实现了最先进的检测性能,能有效捕捉细粒度视觉转换和多样化事件。
English Summary: EventVAD is an event-aware video anomaly detection framework that integrates dynamic graph modeling and multimodal large language models to achieve state-of-the-art performance in training-free settings by capturing fine-grained events through temporal reasoning.

Authors:Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Title: Retrieval-Augmented Generation with Conflicting Evidence
Abstract:
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.
中文: 研究者提出RAMDocs数据集模拟包含歧义和错误信息的复杂检索场景,并开发MADAM-RAG多智能体辩论系统,能同时处理多种信息冲突,在基准测试中最高提升15.80%的准确率。
English: Researchers introduce RAMDocs, a dataset simulating complex retrieval scenarios with ambiguity and misinformation, and MADAM-RAG, a multi-agent debating system that jointly handles conflicting information while improving accuracy over baselines by up to 15.80%.

Authors:Prasanna Reddy Pulakurthi, Majid Rabbani, Celso M. de Melo, Sohail A. Dianat, Raghuveer M. Rao
Title: Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data
Abstract:
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques. The code is available at https://github.com/PrasannaPulakurthi/Foreground-Background-Augmentation
本文提出了一种双区域增强方法,通过对前景和背景区域进行针对性变换来提升模型鲁棒性和泛化能力,减少对大规模标注数据的依赖,在跨域适应和行人重识别任务中表现优异。
This paper presents a dual-region augmentation method that enhances model robustness and generalization for computer vision tasks by applying targeted transformations to foreground and background regions, reducing dependency on large labeled datasets.

Authors:Ruizhe Chen, Dongyu Xue, Xiangxin Zhou, Zaixiang Zheng, Xiangxiang Zeng, Quanquan Gu
Title: An All-Atom Generative Model for Designing Protein Complexes
Abstract:
Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. We released our code at https://github.com/bytedance/apm.
中文: 我们推出了APM模型,专为多链蛋白质建模设计,通过整合原子级数据精确模拟相互作用并设计功能性蛋白质复合物,在多项任务中取得了领先成果。
English: We introduce APM, a model designed for multi-chain protein modeling that integrates atom-level data to accurately model interactions and design functional protein complexes, achieving state-of-the-art results in various tasks.

Authors:Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou
Title: SkyReels-V2: Infinite-length Film Generative Model
Abstract:
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.
中文: 摘要指出视频生成在提示遵循、动态质量和时长方面存在挑战,并提出了SkyReels-V2模型,该模型融合多模态大语言模型、多阶段训练和扩散框架,实现了无限长度的高质量影片生成。
English: The abstract highlights persistent challenges in video generation, such as balancing prompt adherence and motion quality, and introduces SkyReels-V2, a model that integrates MLLM, multi-stage training, and a diffusion framework to enable infinite-length, high-quality film synthesis.

Authors:Yang Yue, Yulin Wang, Haojun Jiang, Pan Liu, Shiji Song, Gao Huang
Title: EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance
Abstract:
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.
中文: EchoWorld是一种运动感知的世界建模框架,通过编码解剖知识和运动动态来提升超声心动图探头引导,利用历史视觉运动数据进行预训练和微调,显著减少了引导误差。
English: EchoWorld is a motion-aware world modeling framework that enhances echocardiography probe guidance by encoding anatomical knowledge and motion dynamics, significantly reducing errors through pre-training and fine-tuning with historical visual-motion data.

Authors:Linkang Du, Zheng Zhu, Min Chen, Zhou Su, Shouling Ji, Peng Cheng, Jiming Chen, Zhikun Zhang
Title: ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models
Abstract:
Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at https://github.com/Jozenn/ArtistAuditor.
中文:ArtistAuditor是一种新颖的数据使用审计方法,通过分析多粒度风格表征来识别文本到图像模型是否使用了特定艺术家的作品进行微调,在实验和实际应用中均展现出高效能。
English: ArtistAuditor is a novel data-use auditing method that identifies if a text-to-image model has been fine-tuned using specific artists' works by analyzing multi-granularity style representations, achieving high effectiveness in experiments and real-world applications.

Authors:Dachun Kai, Yueyi Zhang, Jin Wang, Zeyu Xiao, Zhiwei Xiong, Xiaoyan Sun
Title: Event-Enhanced Blurry Video Super-Resolution
Abstract:
In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is +2.59 dB more accurate and 7.28$\times$ faster than the recent best BVSR baseline FMA-Net. Code: https://github.com/DachunKai/Ev-DeblurVSR.
中文: 本文提出Ev-DeblurVSR事件增强网络,通过融合事件信号与互惠特征去模糊模块及混合可变形对齐模块,在模糊视频超分辨率任务中实现了最优性能,在真实数据上显著提升了精度与处理速度。
English: This paper introduces Ev-DeblurVSR, an event-enhanced network that integrates event signals with reciprocal feature deblurring and hybrid deformable alignment to achieve state-of-the-art performance in blurry video super-resolution, significantly improving accuracy and speed on real-world data.

Authors:Yide Liu, Haijiang Sun, Xiaowen Zhang, Qiaoyuan Liu, Zhouchang Chen, Chongzhuo Xiao
Title: TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution
Abstract:
Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: https://github.com/LED-666/TTRD3.
中文: 提出的纹理传递残差去噪双重扩散模型(TTRD3)通过多尺度特征提取、稀疏纹理引导和双重扩散框架,解决了遥感图像超分辨率中的关键难题,在多项指标上优于现有最佳方法,LPIPS提升1.43%,FID改善3.67%。
English: The proposed Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) addresses key challenges in Remote Sensing Image Super-Resolution by incorporating multi-scale feature extraction, texture transfer guidance, and a dual diffusion framework, demonstrating superior performance with 1.43% LPIPS and 3.67% FID improvements over existing methods.

Authors:Guoqing Zhang, Jingyun Yang, Yang Li
Title: Hierarchical Feature Learning for Medical Point Clouds via State Space Model
Abstract:
Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.
中文: 本文提出了一种基于状态空间模型的分层特征学习框架,通过多尺度结构信息聚合和创新的点云序列化策略,在新建的MedPointS医疗点云数据集上实现了分类、补全和分割任务的最优性能。
English: This paper introduces a hierarchical SSM-based framework for medical point cloud analysis, employing multi-scale feature aggregation and novel scanning strategies to achieve state-of-the-art performance on the newly created MedPointS dataset for classification, completion, and segmentation tasks.

Authors:Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko
Title: Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation
Abstract:
Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance because of high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, that are essential for efficient low-precision computations. In this paper, we introduce Tilus, a domain-specific language designed for General-Purpose GPU (GPGPU) computing that supports low-precision data types with arbitrary bit widths from 1 to 8 while maintaining GPU programmability. Tilus features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. Tilus programs are compiled into highly efficient GPU programs through automatic vectorization and instruction selection. Extensive experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels. Compared to existing compilers such as Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, Tilus achieves performance improvements of: $1.75\times$, $2.61\times$, $1.29\times$ and $1.03\times$, respectively. We open-source Tilus at https://github.com/NVIDIA/tilus.
中文:Tilus是一种面向通用GPU计算的领域特定语言,支持1至8位任意位宽的低精度计算,通过优化GPU编程显著超越了现有解决方案的性能。
English: Tilus is a domain-specific language for GPGPU computing that enables efficient low-precision computation with arbitrary bit widths from 1 to 8, outperforming existing solutions through optimized GPU programming.

Authors:Al Arsh Basheer, Justin Chang, Yuyang Chen, David Kim, Iman Soltani
Title: Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation
Abstract:
This paper presents the Krysalis Hand, a five-finger robotic end-effector that combines a lightweight design, high payload capacity, and a high number of degrees of freedom (DoF) to enable dexterous manipulation in both industrial and research settings. This design integrates the actuators within the hand while maintaining an anthropomorphic form. Each finger joint features a self-locking mechanism that allows the hand to sustain large external forces without active motor engagement. This approach shifts the payload limitation from the motor strength to the mechanical strength of the hand, allowing the use of smaller, more cost-effective motors. With 18 DoF and weighing only 790 grams, the Krysalis Hand delivers an active squeezing force of 10 N per finger and supports a passive payload capacity exceeding 10 lbs. These characteristics make Krysalis Hand one of the lightest, strongest, and most dexterous robotic end-effectors of its kind. Experimental evaluations validate its ability to perform intricate manipulation tasks and handle heavy payloads, underscoring its potential for industrial applications as well as academic research. All code related to the Krysalis Hand, including control and teleoperation, is available on the project GitHub repository: https://github.com/Soltanilara/Krysalis_Hand
中文:Krysalis Hand是一款轻量级、18自由度的机器人末端执行器,集成了执行器和自锁关节,每指可产生10N主动抓握力并承载超过10磅被动负载,兼具高灵巧性与强负载能力,适用于工业及科研场景。
English: The Krysalis Hand is a lightweight, 18-DoF robotic end-effector with integrated actuators and self-locking joints, enabling high dexterity, a 10N active force per finger, and over 10 lbs passive payload capacity for versatile industrial and research use.

Authors:Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
Title: Disentangling Polysemantic Channels in Convolutional Neural Networks
Abstract:
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
Chinese: 本文提出一种算法,将卷积神经网络中的多义通道分解为多个单一概念通道,通过基于不同激活模式重构权重来提高模型可解释性。
English: This paper introduces an algorithm to disentangle polysemantic channels in CNNs into distinct concept-specific channels, enhancing interpretability by restructuring weights based on unique activation patterns.

Authors:Ebrahim Norouzi, Sven Hertling, Harald Sack
Title: ConExion: Concept Extraction with Large Language Models
Abstract:
In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at https://github.com/ISE-FIZKarlsruhe/concept_extraction.
中文: 本文提出了一种利用预训练大语言模型从文档中提取所有领域相关概念的方法,相比现有技术提升了F1值,并通过提示探索无监督提取以支持本体评估和学习。
English: This paper introduces a method using pre-trained large language models to extract all domain-related concepts from documents, showing improved F1 scores over existing techniques and exploring unsupervised extraction via prompts to aid ontology evaluation and learning.

Authors:Youyi Zhan, Tianjia Shao, Yin Yang, Kun Zhou
Title: Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
Abstract:
Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.
中文: 本文提出了一种新颖的高斯人体化身表示方法,通过分布在人体不同位置的多层感知机重建高保真的姿态相关外观细节,同时实现实时渲染,在视觉质量和速度上均优于现有方法。
English: This paper introduces a novel Gaussian human avatar representation that uses spatially distributed MLPs on the body to reconstruct high-fidelity pose-dependent appearance details while enabling real-time rendering, outperforming existing methods in both visual quality and speed.

Authors:Mingzhe Yu, Yunshan Ma, Lei Wu, Changshuo Wang, Xue Li, Lei Meng
Title: FashionDPO:Fine-tune Fashion Outfit Generation Model using Direct Preference Optimization
Abstract:
Personalized outfit generation aims to construct a set of compatible and personalized fashion items as an outfit. Recently, generative AI models have received widespread attention, as they can generate fashion items for users to complete an incomplete outfit or create a complete outfit. However, they have limitations in terms of lacking diversity and relying on the supervised learning paradigm. Recognizing this gap, we propose a novel framework FashionDPO, which fine-tunes the fashion outfit generation model using direct preference optimization. This framework aims to provide a general fine-tuning approach to fashion generative models, refining a pre-trained fashion outfit generation model using automatically generated feedback, without the need to design a task-specific reward function. To make sure that the feedback is comprehensive and objective, we design a multi-expert feedback generation module which covers three evaluation perspectives, \ie quality, compatibility and personalization. Experiments on two established datasets, \ie iFashion and Polyvore-U, demonstrate the effectiveness of our framework in enhancing the model's ability to align with users' personalized preferences while adhering to fashion compatibility principles. Our code and model checkpoints are available at https://github.com/Yzcreator/FashionDPO.
Chinese: 提出的FashionDPO框架采用直接偏好优化方法微调时尚搭配生成模型,无需特定任务奖励函数即可提升生成搭配的多样性、兼容性和个性化程度,在iFashion和Polyvore-U数据集上的实验验证了其有效性。
English: The proposed FashionDPO framework introduces a direct preference optimization method to fine-tune fashion outfit generation models, enhancing their ability to produce diverse, compatible, and personalized outfits without task-specific reward functions, as validated on iFashion and Polyvore-U datasets.

Authors:Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
Title: EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Abstract:
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.
中文:EmoVoice是一种新颖的情感可控语音合成模型,利用大语言模型实现细粒度的自然语言情感控制,并通过音素增强设计提高内容一致性,在英文和中文测试集上均取得了最先进的性能表现。
English: EmoVoice is an emotion-controllable TTS model that utilizes large language models for fine-grained natural language emotion control and a phoneme boost design to enhance content consistency, achieving state-of-the-art performance on both English and Chinese test sets.

Authors:Pengxuan Yang, Yupeng Zheng, Qichao Zhang, Kefei Zhu, Zebin Xing, Qiao Lin, Yun-Fu Liu, Zhiguo Su, Dongbin Zhao
Title: UncAD: Towards Safe End-to-end Autonomous Driving via Online Map Uncertainty
Abstract:
End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.
中文: 提出的UncAD范式通过估计并利用在线地图的不确定性来指导轨迹预测与选择,以微小参数代价显著降低碰撞与冲突率,从而提升自动驾驶安全性。
English: The proposed UncAD paradigm enhances autonomous driving safety by estimating and utilizing online map uncertainty to guide trajectory prediction and selection, significantly reducing collision and conflict rates with minimal parameter increase.

Authors:Xue Wen Tan, Stanley Kok
Title: SMARTe: Slot-based Method for Accountable Relational Triple extraction
Abstract:
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research. Our code is available at https://github.com/Chen-XueWen/SMARTe.
中文摘要:SMARTe是一种基于槽注意力的可解释关系三元组抽取方法,通过将信息整合至可追溯的槽表示中,在保持与先进模型相当性能的同时实现了内在可解释性。
English Summary: SMARTe is an interpretable relational triple extraction method that uses slot attention to consolidate information into traceable representations while maintaining performance comparable to state-of-the-art models.

Authors:Inzamamul Alam, Md Tanvir Islam, Simon S. Woo
Title: Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal
Abstract:
As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{https://github.com/inzamamulDU/SADRE}{\textbf{https://github.com/inzamamulDU/SADRE}}.
中文: 本文提出的SADRE框架通过显著性引导的噪声注入和扩散重建技术,在保持图像质量的同时有效去除水印,其性能优于现有方法。
English: This paper proposes the SADRE framework, which effectively removes watermarks through saliency-guided noise injection and diffusion reconstruction while maintaining image quality, outperforming existing methods.

Authors:Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong
Title: Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts
Abstract:
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT
Chinese Summary: 本文提出ANT微调框架,通过自动引导去噪轨迹避免生成有害内容,在单概念与多概念消除任务中均达到最优性能,且不损害图像生成质量。
English Summary: The paper introduces ANT, a finetuning framework that automatically guides denoising trajectories to prevent harmful content generation in text-to-image models, achieving state-of-the-art performance in both single and multi-concept erasure without compromising image quality.

Authors:Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma
Title: Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbf{GeoGen} produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbf{GeoLogic}, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.
中文: 本文提出GeoGen自动生成几何问题的逐步推理数据,并训练GeoLogic模型将符号系统融入多模态大语言模型,以增强几何推理能力、减少幻觉现象并显著提升任务表现。
English: This paper introduces GeoGen, a pipeline that generates step-by-step reasoning data for geometry problems, and GeoLogic, a model trained on this data to enhance multimodal large language models' geometric reasoning by integrating symbolic systems, reducing hallucinations and improving performance.

Authors:Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu Yu
Title: GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks
Abstract:
This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.
Chinese: GraphOmni作为评估大语言模型图推理能力的综合基准,揭示了影响性能的关键因素,并提出自适应框架以提升模型表现。
English: GraphOmni is a comprehensive benchmark for evaluating LLMs' graph reasoning abilities, revealing key performance factors and proposing an adaptive framework to enhance model capabilities.

Authors:Siyu Chen, Ting Han, Changshe Zhang, Xin Luo, Meiliu Wu, Guorong Cai, Jinhe Su
Title: Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
Abstract:
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.
中文: DepthForge通过整合深度信息增强视觉基础模型的几何一致性,在语义分割中显著提升了泛化能力,尤其在极端环境下表现卓越。
English: DepthForge enhances Vision Foundation Models' generalization in semantic segmentation by integrating depth information to improve geometric consistency, achieving superior performance across diverse conditions.

Authors:Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa
Title: Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Abstract:
Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.
中文: CLIP-Refine是一种后预训练方法,通过随机特征对齐和混合对比蒸馏技术,有效缩小CLIP模型中的模态差距,在提升零样本性能的同时避免性能下降。
English: CLIP-Refine is a post-pre-training method that reduces the modality gap in CLIP models by aligning image and text features through random feature alignment and hybrid contrastive-distillation, improving zero-shot performance without degradation.

Authors:Qianqian Sun, Jixiang Luo, Dell Zhang, Xuelong Li
Title: SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Abstract:
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
中文: SmartFreeEdit提出了一种创新的端到端框架,将多模态大语言模型与超图增强修复架构相结合,通过自然语言指令实现精确的无掩码图像编辑,有效解决了空间推理和语义一致性等关键难题。
English: SmartFreeEdit introduces an innovative end-to-end framework that integrates a multimodal large language model with hypergraph-enhanced inpainting, enabling precise, mask-free image editing guided by natural language instructions while overcoming challenges in spatial reasoning and semantic consistency.

Authors:Naibang Wang, Deyong Shang, Yan Gong, Xiaoxi Hu, Ziying Song, Lei Yang, Yuhan Huang, Xiaoyu Wang, Jianli Lu
Title: Collaborative Perception Datasets for Autonomous Driving: A Review
Abstract:
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative-Perception-Datasets-for-Autonomous-Driving.
中文摘要:本文作为首个专注于协同感知数据集的系统性综述,从多维度比较现有资源,分析合作范式与传感器配置,并指出数据集可扩展性、标准化等关键挑战与未来方向。
English Summary: This comprehensive review analyzes collaborative perception datasets for autonomous driving, comparing them across cooperation paradigms, sensor configurations, and application scenarios while identifying key challenges like dataset scalability and standardization.

Authors:Qishan Wang, Shuyong Gao, Junjie Hu, Jiawen Yu, Xuan Tong, You Li, Wenqiang Zhang
Title: HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset
Abstract:
Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at https://github.com/Qiqigeww/HSS-IAD-Dataset.
中文: HSS-IAD数据集通过提供具有多样化结构和外观的金属部件及真实缺陷标注,弥补了现有工业异常检测数据集的不足,为多类别无监督异常检测方法在真实工厂环境下的评估提供了更有效的基准。
English: The HSS-IAD dataset addresses limitations in current industrial anomaly detection datasets by providing diverse metallic parts with realistic defects, enabling more accurate evaluation of multi-class unsupervised anomaly detection methods under real-world conditions.

Authors:Pengtao Dang, Tingbo Guo, Melissa Fishel, Guang Lin, Wenzhuo Wu, Sha Cao, Chi Zhang
Title: Physics Informed Constrained Learning of Dynamics from Static Data
Abstract:
A physics-informed neural network (PINN) models the dynamics of a system by integrating the governing physical laws into the architecture of a neural network. By enforcing physical laws as constraints, PINN overcomes challenges with data scarsity and potentially high dimensionality. Existing PINN frameworks rely on fully observed time-course data, the acquisition of which could be prohibitive for many systems. In this study, we developed a new PINN learning paradigm, namely Constrained Learning, that enables the approximation of first-order derivatives or motions using non-time course or partially observed data. Computational principles and a general mathematical formulation of Constrained Learning were developed. We further introduced MPOCtrL (Message Passing Optimization-based Constrained Learning) an optimization approach tailored for the Constrained Learning framework that strives to balance the fitting of physical models and observed data. Its code is available at github link: https://github.com/ptdang1001/MPOCtrL Experiments on synthetic and real-world data demonstrated that MPOCtrL can effectively detect the nonlinear dependency between observed data and the underlying physical properties of the system. In particular, on the task of metabolic flux analysis, MPOCtrL outperforms all existing data-driven flux estimators.
中文: 本研究提出了约束学习这一新的物理信息神经网络范式,能够利用非时间序列数据近似系统动力学,并开发了MPOCtrL优化方法,在保持物理模型与观测数据平衡方面表现优异,在代谢通量分析等任务中超越了现有方法。
English: This study introduces Constrained Learning, a new physics-informed neural network paradigm that uses non-time course data to approximate system dynamics, along with an optimization method called MPOCtrL that effectively balances physical models with observed data, outperforming existing approaches in tasks like metabolic flux analysis.

Authors:Lvmin Zhang, Maneesh Agrawala
Title: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Abstract:
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
中文: FramePack是一种神经网络结构,通过压缩输入帧以保持固定的变换器上下文长度,从而高效训练下一帧视频预测模型,实现与图像扩散相当的计算瓶颈和更高批处理量,并采用防漂移采样方法减少误差累积。
English: FramePack is a neural network structure that enables efficient training of next-frame video prediction models by compressing input frames to maintain a fixed transformer context length, allowing for higher batch sizes and reduced computational bottlenecks similar to image diffusion, while incorporating an anti-drifting sampling method to minimize error accumulation.

Authors:John Chiang
Title: Privacy-Preserving CNN Training with Transfer Learning: Two Hidden Layers
Abstract:
In this paper, we present the demonstration of training a four-layer neural network entirely using fully homomorphic encryption (FHE), supporting both single-output and multi-output classification tasks in a non-interactive setting. A key contribution of our work is identifying that replacing \textit{Softmax} with \textit{Sigmoid}, in conjunction with the Binary Cross-Entropy (BCE) loss function, provides an effective and scalable solution for homomorphic classification. Moreover, we show that the BCE loss function, originally designed for multi-output tasks, naturally extends to the multi-class setting, thereby enabling broader applicability. We also highlight the limitations of prior loss functions such as the SLE loss and the one proposed in the 2019 CVPR Workshop, both of which suffer from vanishing gradients as network depth increases. To address the challenges posed by large-scale encrypted data, we further introduce an improved version of the previously proposed data encoding scheme, \textit{Double Volley Revolver}, which achieves a better trade-off between computational and memory efficiency, making FHE-based neural network training more practical. The complete, runnable C++ code to implement our work can be found at: \href{https://github.com/petitioner/ML.NNtraining}{$\texttt{https://github.com/petitioner/ML.NNtraining}$}.
中文: 本文展示了使用全同态加密训练四层神经网络,通过Sigmoid函数与二元交叉熵损失的组合实现了有效的同态分类,并改进了数据编码方案以提升计算与存储效率。
English: This paper demonstrates training a four-layer neural network with fully homomorphic encryption, using Sigmoid with Binary Cross-Entropy loss as an effective solution for homomorphic classification and introducing an improved data encoding scheme for better efficiency.

Authors:Yun-Cheng Li, Sen Lei, Yi-Tao Zhao, Heng-Chao Li, Jun Li, Antonio Plaza
Title: SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping
Abstract:
Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at https://github.com/SUPERMAN123000/FAEWNet.
中文摘要:提出的FAEWNet模型通过结合SAM编码器、分布感知傅里叶适配器和边缘约束变形模块,有效解决了建筑变化检测中的领域差异和时序错位问题,在多个数据集上取得了最优性能。
English Summary: The proposed FAEWNet model overcomes limitations in building change detection by integrating a SAM encoder with a distribution-aware Fourier adapter and an edge-constrained warping module, achieving state-of-the-art performance on multiple datasets.

Authors:Kewen Peng, Hao Zhuo, Yicheng Yang, Tim Menzies
Title: Software Engineering Principles for Fairer Systems: Experiments with GroupCART
Abstract:
Discrimination-aware classification aims to make accurate predictions while satisfying fairness constraints. Traditional decision tree learners typically optimize for information gain in the target attribute alone, which can result in models that unfairly discriminate against protected social groups (e.g., gender, ethnicity). Motivated by these shortcomings, we propose GroupCART, a tree-based ensemble optimizer that avoids bias during model construction by optimizing not only for decreased entropy in the target attribute but also for increased entropy in protected attributes. Our experiments show that GroupCART achieves fairer models without data transformation and with minimal performance degradation. Furthermore, the method supports customizable weighting, offering a smooth and flexible trade-off between predictive performance and fairness based on user requirements. These results demonstrate that algorithmic bias in decision tree models can be mitigated through multi-task, fairness-aware learning. All code and datasets used in this study are available at: https://github.com/anonymous12138/groupCART.
中文:GroupCART是一种基于树集成的公平性优化方法,通过在模型构建中同时优化目标属性和保护属性的熵,无需数据转换即可实现公平分类,且支持根据需求灵活调整预测性能与公平性的平衡。
English: GroupCART is a fairness-aware ensemble method that optimizes for both target prediction accuracy and fairness by balancing entropy in target and protected attributes, achieving equitable models with minimal performance loss and customizable trade-offs.

Authors:Wentao Wu, Xiao Wang, Chenglong Li, Bo Jiang, Jin Tang, Bin Luo, Qi Liu
Title: CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework
Abstract:
Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.
中文:CM3AE框架提出了一种新颖的RGB-事件感知预训练方法,通过融合多模态数据并采用融合重建与对比学习策略,有效提升跨模态理解能力,在多种下游任务中展现出卓越性能。
English: The CM3AE framework introduces a novel pre-training approach for RGB-Event perception by integrating multi-modal data and employing fusion reconstruction and contrastive learning to enhance cross-modal understanding and performance across diverse downstream tasks.

Authors:Haidar Khan, Hisham A. Alyahya, Yazeed Alnumay, M Saiful Bari, Bülent Yener
Title: ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
Abstract:
Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.
Chinese: ZeroSumEval提出了一种基于零和博弈的竞争性评估协议,通过动态基准测试大语言模型,发现尽管它们在常规任务中表现良好,但在创造性和新颖问题解决方面存在明显不足。
English: ZeroSumEval introduces a competition-based evaluation protocol using zero-sum games to dynamically assess Large Language Models, revealing their limitations in creativity and novel problem-solving despite proficiency in common tasks.

Authors:Negar Arabzadeh, Charles L. A. Clarke
Title: Benchmarking LLM-based Relevance Judgment Methods
Abstract:
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.
中文: 本文系统比较了多种基于大语言模型的相关性评估方法,包括二元判断和成对偏好等,通过多个数据集验证其与人工评估的一致性,并提供全面的对比分析。
English: This paper systematically compares multiple LLM-based relevance assessment methods, including binary judgments and pairwise preferences, across multiple datasets to evaluate their alignment with human judgments and provide comprehensive comparative insights.

Authors:Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, Amelia Glaese
Title: BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Abstract:
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
中文: BrowseComp是一个简洁而具有挑战性的基准测试,通过1,266个可验证的简答题来评估网络浏览代理持续查找复杂关联信息的能力。
English: BrowseComp is a straightforward yet demanding benchmark designed to assess web browsing agents' ability to persistently locate complex, interconnected information through 1,266 verifiable short-answer questions.

Authors:Kaustav Chanda, Aayush Atul Verma, Arpitsinh Vaghela, Yezhou Yang, Bharatesh Chakravarthi
Title: Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space
Abstract:
Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.
中文: 事件相机具有低延迟和高动态范围等优势,但其在深度学习中的应用受限于高质量标注数据的稀缺以及模拟其独特传感器设计的挑战,为此引入事件质量评分(EQS)指标来提升模拟器的真实性并缩小仿真差距。
English: Event cameras offer advantages like low latency and high dynamic range, but their adoption in deep learning is limited by the scarcity of labeled data and the challenge of accurately simulating their unique sensor design, leading to the introduction of the event quality score (EQS) metric to improve simulator realism and reduce the simulation gap.

Authors:Kaira M. Samuel, Faez Ahmed
Title: Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking Study
Abstract:
Engineering problems that apply machine learning often involve computationally intensive methods but rely on limited datasets. As engineering data evolves with new designs and constraints, models must incorporate new knowledge over time. However, high computational costs make retraining models from scratch infeasible. Continual learning (CL) offers a promising solution by enabling models to learn from sequential data while mitigating catastrophic forgetting, where a model forgets previously learned mappings. This work introduces CL to engineering design by benchmarking several CL methods on representative regression tasks. We apply these strategies to five engineering datasets and construct nine new engineering CL benchmarks to evaluate their ability to address forgetting and improve generalization. Preliminary results show that applying existing CL methods to these tasks improves performance over naive baselines. In particular, the Replay strategy achieved performance comparable to retraining in several benchmarks while reducing training time by nearly half, demonstrating its potential for real-world engineering workflows. The code and datasets used in this work will be available at: https://github.com/kmsamuel/cl-for-engineering-release.
中文: 本研究将持续学习引入工程设计,通过在回归任务上对多种方法进行基准测试,结果表明Replay策略在减少近一半训练时间的同时,实现了与完全重新训练相当的性能,相关代码和数据集已公开。
English: This study introduces continual learning to engineering design by benchmarking various methods on regression tasks, showing that the Replay strategy achieves performance close to full retraining while cutting training time by nearly half, with code and datasets made publicly available.

Authors:Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein
Title: Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Abstract:
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
中文: 默认混合专家模型通过使用默认输出为路由器提供密集梯度更新,从而显著提升训练稳定性和性能,且无需额外计算开销。
English: Default MoE enhances training stability and performance by providing dense gradient updates to the router through default outputs, without adding significant computational cost.

Authors:Minmin Yang, Huantao Ren, Senem Velipasalar
Title: 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap
Abstract:
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.
中文:提出的3D-PointZshotS框架通过引入潜在几何原型和自一致性损失来弥合语义与视觉间的鸿沟,显著提升了三维点云零样本分割性能,在多个数据集上表现优异。
English: The proposed 3D-PointZshotS framework enhances zero-shot 3D point cloud segmentation by incorporating latent geometric prototypes and a self-consistency loss to bridge the semantic-visual gap, achieving superior performance on multiple datasets.

Authors:Negar Arabzadeh, Charles L. A . Clarke
Title: A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
Abstract:
Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks, often demonstrating agreement with human labels that approaches inter-human agreement. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task. We collected prompts for relevance assessment from 15 human experts and 15 LLMs across three tasks~ -- ~binary, graded, and pairwise~ -- ~yielding 90 prompts in total. After filtering out unusable prompts from three humans and three LLMs, we employed the remaining 72 prompts with three different LLMs as judges to label document/query pairs from two TREC Deep Learning Datasets (2020 and 2021). We compare LLM-generated labels with TREC official human labels using Cohen's $κ$ and pairwise agreement measures. In addition to investigating the impact of prompt variations on agreement with human labels, we compare human- and LLM-generated prompts and analyze differences among different LLMs as judges. We also compare human- and LLM-generated prompts with the standard UMBRELA prompt used for relevance assessment by Bing and TREC 2024 Retrieval Augmented Generation (RAG) Track. To support future research in LLM-based evaluation, we release all data and prompts at https://github.com/Narabzad/prompt-sensitivity-relevance-judgements/.
中文摘要:大型语言模型在信息检索中越来越多地用于自动化相关性判断,其与人类标注的一致性接近人类间一致性,本研究系统评估了提示敏感性在不同任务和数据集上对模型鲁棒性和可靠性的影响。
English Summary: Large Language Models (LLMs) are increasingly used for automated relevance judgments in information retrieval, showing agreement with human labels that nears inter-human agreement, while this study systematically evaluates the impact of prompt sensitivity on their robustness and reliability across various tasks and datasets.

Authors:Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox
Title: Activated LoRA: Fine-tuned LLMs for Intrinsics
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence \emph{after} the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the cache. This enables building what we call \emph{intrinsics}, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while achieving significant inference benefits. The codebase is at https://github.com/IBM/activated-lora.
中文: 激活式低秩适应(aLoRA)改进了LoRA框架,仅对调用后的序列令牌进行权重调整,无需重新计算先前的键值缓存即可即时激活,从而显著提升了专用模型操作的推理效率。
English: Activated LoRA (aLoRA) enhances the LoRA framework by adapting weights only for tokens after its invocation, enabling instant activation without recomputing the prior KV cache and improving inference efficiency for specialized model operations.

Authors:Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox
Title: Activated LoRA: Fine-tuned LLMs for Intrinsics
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence after the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the prior keys and values. This enables building what we call intrinsics, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while significantly improving inference efficiency. We contributed our Activated LoRA implementation to the Huggingface PEFT library https://github.com/huggingface/peft.
中文: 激活式低秩适应(aLoRA)改进了LoRA框架,仅对调用后的序列令牌进行权重调整,无需重新计算先前的键值缓存即可即时激活,从而显著提升了专用模型操作的推理效率。
English: Activated LoRA (aLoRA) enhances the LoRA framework by adapting weights only for tokens after its invocation, enabling instant activation without recomputing the prior KV cache and improving inference efficiency for specialized model operations.

Authors:Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu
Title: InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework
Abstract:
Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.
中文摘要:InstantCharacter是一个基于扩散变换器的可扩展框架,通过创新的适配器和大规模数据集训练,实现了跨领域的高保真、文本可控角色定制,解决了现有方法的泛化能力和图像质量问题。
English Summary: InstantCharacter is a scalable framework built on a diffusion transformer that overcomes limitations of existing methods by achieving high-fidelity, text-controllable character customization across diverse domains through a novel adapter and large-scale dataset training.

Authors:Sidun Liu, Wenyu Li, Peng Qiao, Yong Dou
Title: Regist3R: Incremental Registration with Stereo Foundation Model
Abstract:
Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.
中文: Regist3R是一种新颖的立体基础模型,能够从多视角图像实现高效可扩展的增量式三维重建,在保持与优化方法相当性能的同时显著提升计算效率,并在真实场景数据上超越现有模型。
English: Regist3R is a novel stereo foundation model that enables efficient and scalable incremental 3D reconstruction from multi-view images, achieving comparable performance to optimization-based methods while significantly improving computational efficiency and outperforming existing models on challenging real-world datasets.

Authors:Nay Myat Min, Long H. Pham, Yige Li, Jun Sun
Title: Propaganda via AI? A Study on Semantic Backdoors in Large Language Models
Abstract:
Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at https://github.com/NayMyatMin/RAVEN.
中文摘要:大型语言模型易受基于概念触发器的语义后门攻击,为此研发的RAVEN检测框架通过跨模型一致性分析,成功在多种模型中发现了这类隐蔽漏洞。
English Summary: Large language models are susceptible to semantic backdoor attacks using conceptual triggers, prompting the development of RAVEN, a detection framework that successfully identifies these hidden vulnerabilities across multiple models.

Authors:Xiangju Li, Dong Yang, Xiaogang Zhu, Faliang Huang, Peng Zhang, Zhongying Zhao
Title: Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation
Abstract:
Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method's effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at https://github.com/zxgnlp/InstruDa-LLM.
中文: 本研究提出了一种基于指令调优大语言模型和数据增强的创新框架,显著提升了跨度级情感-原因-类别三元组提取性能,比现有方法至少提高了12.8%的指标。
English: This study introduces a novel framework using instruction-tuned large language models and data augmentation to significantly improve span-level emotion-cause-category triplet extraction, achieving over 12.8% better performance than existing methods.

Authors:Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma
Title: HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.
Chinese: HM-RAG提出了一种分层多智能体框架,通过分解复杂查询并整合多源异构数据来增强多模态推理能力,相比传统RAG系统在多个基准测试中实现了显著准确率提升。
English: HM-RAG introduces a hierarchical multi-agent framework that enhances multimodal reasoning by decomposing complex queries and integrating diverse data sources, achieving significant accuracy improvements over conventional RAG systems.

Authors:Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, Lei Zou
Title: A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Abstract:
Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}.
Chinese: 本文全面综述了奖励模型的研究进展、应用及挑战,旨在为初学者提供系统指导并推动该领域的未来发展。
English: This paper offers a comprehensive overview of reward models, detailing their development, applications, and challenges to serve as a foundational guide for beginners and future research.

Authors:Mengying Yuan, Wenhao Wang, Zixuan Wang, Yujie Huang, Kangli Wei, Fei Li, Chong Teng, Donghong Ji
Title: Cross-Document Cross-Lingual NLI via RST-Enhanced Graph Fusion and Interpretability Prediction
Abstract:
Natural Language Inference (NLI) is a fundamental task in natural language processing. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm: CDCL-NLI, which extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 25,410 instances and spanning 26 languages. To address the limitations of previous methods on CDCL-NLI task, we further propose an innovative method that integrates RST-enhanced graph fusion with interpretability-aware prediction. Our approach leverages RST (Rhetorical Structure Theory) within heterogeneous graph neural networks for cross-document context modeling, and employs a structure-aware semantic alignment based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU (Elementary Discourse Unit)-level attribution framework that produces extractive explanations. Extensive experiments demonstrate our approach's superior performance, achieving significant improvements over both conventional NLI models as well as large language models. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, hallucination elimination and interpretability inference. Our code and datasets are available at "https://github.com/Leonardo123-ui/CDCL_NLI" for peer review.
中文摘要:本文提出了跨文档跨语言自然语言推理(CDCL-NLI)的新范式,通过结合RST增强图融合与可解释性预测的方法,并构建多语言数据集,显著提升了跨文档跨语言语境下的推理性能。
English Summary: This paper introduces Cross-Document Cross-Lingual Natural Language Inference (CDCL-NLI), proposing a novel method combining RST-enhanced graph fusion with interpretability-aware prediction and creating a multilingual dataset to advance cross-document, cross-lingual understanding.

Authors:Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, Conghui He
Title: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Abstract:
While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.
中文摘要:GRA框架通过模拟同行评审流程,使多个专业化的小型语言模型协作生成高质量数据,在保持高效可持续的同时,实现了与大型模型相当的性能表现。
English Summary: The GRA framework enables multiple specialized small language models to collaboratively generate high-quality data through a peer-review-inspired process, achieving performance comparable to large models while being more efficient and sustainable.

Authors:Pouya Samanipour, Hasan Poonawala
Title: SEROAISE: Advancing ROA Estimation for ReLU and PWA Dynamics through Estimating Certified Invariant Sets
Abstract:
This paper presents a novel framework for constructing the Region of Attraction (RoA) for dynamics derived either from Piecewise Affine (PWA) functions or from Neural Networks (NNs) with Rectified Linear Units (ReLU) activation function. This method, described as Sequential Estimation of RoA based on Invariant Set Estimation (SEROAISE), computes a Lyapunov-like PWA function over a certified PWA invariant set. While traditional approaches search for Lyapunov functions by enforcing Lyapunov conditions over pre-selected domains, this framework enforces Lyapunov-like conditions over a certified invariant subset obtained using the Iterative Invariant Set Estimator(IISE). Compared to the state-of-the-art, IISE provides systematically larger certified invariant sets. In order to find a larger invariant subset, the IISE utilizes a novel concept known as the Non-Uniform Growth of Invariant Set (NUGIS). A number of examples illustrating the efficacy of the proposed methods are provided, including dynamical systems derived from learning algorithms. The implementation is publicly available at: https://github.com/PouyaSamanipour/SEROAISE.git.
中文摘要:本文提出了一种名为SEROAISE的新框架,用于基于分段仿射函数或ReLU激活神经网络构建动力学系统的吸引域,通过计算经认证的不变集上的类李雅普诺夫函数,并利用非均匀增长概念,有效扩大了不变集范围,其有效性通过实例和公开代码得以验证。
English Summary: This paper introduces a novel framework called SEROAISE for estimating the Region of Attraction (RoA) in dynamical systems using Piecewise Affine functions or Neural Networks with ReLU activations, which computes Lyapunov-like functions over certified invariant sets and demonstrates improved performance through examples and public implementation.

Authors:Stefan Abi-Karam, Cong Hao
Title: HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks
Abstract:
The rapid scaling of large language model (LLM) training and inference has driven their adoption in semiconductor design across academia and industry. While most prior work evaluates LLMs on hardware description language (HDL) tasks, particularly Verilog, designers are increasingly using high-level synthesis (HLS) to build domain-specific accelerators and complex hardware systems. However, benchmarks and tooling to comprehensively evaluate LLMs for HLS design tasks remain scarce. To address this, we introduce HLS-Eval, the first complete benchmark and evaluation framework for LLM-driven HLS design. HLS-Eval targets two core tasks: (1) generating HLS code from natural language descriptions, and (2) performing HLS-specific code edits to optimize performance and hardware efficiency. The benchmark includes 94 unique designs drawn from standard HLS benchmarks and novel sources. Each case is prepared via a semi-automated flow that produces a natural language description and a paired testbench for C-simulation and synthesis validation, ensuring each task is "LLM-ready." Beyond the benchmark, HLS-Eval offers a modular Python framework for automated, parallel evaluation of both local and hosted LLMs. It includes a parallel evaluation engine, direct HLS tool integration, and abstractions for to support different LLM interaction paradigms, enabling rapid prototyping of new benchmarks, tasks, and LLM methods. We demonstrate HLS-Eval through baseline evaluations of open-source LLMs on Vitis HLS, measuring outputs across four key metrics - parseability, compilability, runnability, and synthesizability - reflecting the iterative HLS design cycle. We also report pass@k metrics, establishing clear baselines and reusable infrastructure for the broader LLM-for-hardware community. All benchmarks, framework code, and results are open-sourced at https://github.com/stefanpie/hls-eval.
中文:HLS-Eval作为首个针对高层次综合设计任务的大语言模型评估框架,填补了自然语言生成与优化HLS代码工具的空缺,提供了基准测试和模块化评估工具。
English: HLS-Eval is introduced as the first comprehensive benchmark and evaluation framework for assessing large language models in high-level synthesis design tasks, addressing the scarcity of tools for generating and optimizing HLS code from natural language.

Authors:Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
Title: FLIP Reasoning Challenge
Abstract:
Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.
中文: FLIP数据集作为评估AI推理能力的基准,通过视觉序列任务揭示现有模型与人类表现差距显著,而图像描述和集成方法能有效提升其准确性。
English: The FLIP dataset is introduced as a benchmark to evaluate AI reasoning through visual sequence tasks, revealing that current models significantly lag behind human performance and benefit from captioning and ensemble methods.

Authors:Ling Zhang, Shaleen Deep, Jignesh M. Patel, Karthikeyan Sankaralingam
Title: An Evaluation of N-Gram Selection Strategies for Regular Expression Indexing in Contemporary Text Analysis Tasks. Extended Version
Abstract:
Efficient evaluation of regular expressions (regex, for short) is crucial for text analysis, and n-gram indexes are fundamental to achieving fast regex evaluation performance. However, these indexes face scalability challenges because of the exponential number of possible n-grams that must be indexed. Many existing selection strategies, developed decades ago, have not been rigorously evaluated on contemporary large-scale workloads and lack comprehensive performance comparisons. Therefore, a unified and comprehensive evaluation framework is necessary to compare these methods under the same experimental settings. This paper presents the first systematic evaluation of three representative n-gram selection strategies across five workloads, including real-time production logs and genomic sequence analysis. We examine their trade-offs in terms of index construction time, storage overhead, false positive rates, and end-to-end query performance. Through empirical results, this study provides a modern perspective on existing n-gram based regular expression evaluation methods, extensive observations, valuable discoveries, and an adaptable testing framework to guide future research in this domain. We make our implementations of these methods and our test framework available as open-source at https://github.com/mush-zhang/RegexIndexComparison.
中文: 本文首次系统评估了三种用于正则表达式评估的n-gram选择策略,通过多工作负载分析其性能权衡,并提供了开源测试框架以指导未来研究。
English: This paper conducts the first systematic evaluation of three n-gram selection strategies for regular expression evaluation, analyzing their trade-offs across multiple workloads and providing an open-source framework to guide future research.

Authors:Yancheng Zhang, Mengxin Zheng, Xun Chen, Jingtong Hu, Weidong Shi, Lei Ju, Yan Solihin, Qian Lou
Title: zkVC: Fast Zero-Knowledge Proof for Private and Verifiable Computing
Abstract:
In the context of cloud computing, services are held on cloud servers, where the clients send their data to the server and obtain the results returned by server. However, the computation, data and results are prone to tampering due to the vulnerabilities on the server side. Thus, verifying the integrity of computation is important in the client-server setting. The cryptographic method known as Zero-Knowledge Proof (ZKP) is renowned for facilitating private and verifiable computing. ZKP allows the client to validate that the results from the server are computed correctly without violating the privacy of the server's intellectual property. Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zkSNARKs), in particular, has been widely applied in various applications like blockchain and verifiable machine learning. Despite their popularity, existing zkSNARKs approaches remain highly computationally intensive. For instance, even basic operations like matrix multiplication require an extensive number of constraints, resulting in significant overhead. In addressing this challenge, we introduce \textit{zkVC}, which optimizes the ZKP computation for matrix multiplication, enabling rapid proof generation on the server side and efficient verification on the client side. zkVC integrates optimized ZKP modules, such as Constraint-reduced Polynomial Circuit (CRPC) and Prefix-Sum Query (PSQ), collectively yielding a more than 12-fold increase in proof speed over prior methods. The code is available at https://github.com/UCF-Lou-Lab-PET/zkformer
中文摘要:在云计算中,零知识证明(ZKP)可让客户端验证服务器计算结果的完整性同时保护隐私,新提出的zkVC系统通过约束优化技术将矩阵运算的证明速度提升了12倍以上。
English Summary: In cloud computing, Zero-Knowledge Proofs (ZKP) enable clients to verify server computations while preserving privacy, and the proposed zkVC system dramatically accelerates proof generation for matrix operations through novel optimizations.

Authors:Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, Vladlen Koltun
Title: CoMotion: Concurrent Multi-person 3D Motion
Abstract:
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion
中文: 本文提出一种单目三维多人姿态跟踪系统,通过从输入图像直接更新姿态实现拥挤场景中的时序一致性,在保持顶尖精度的同时实现了更快、更稳定的在线跟踪效果。
English: This paper presents a monocular 3D multi-person pose tracking system that achieves temporal coherence in crowded scenes through direct pose updates from input images, matching state-of-the-art accuracy while enabling faster and more robust online tracking.

Authors:Yike Liu, Haipeng Li, Shuaicheng Liu, Bing Zeng
Title: CodingHomo: Bootstrapping Deep Homography With Video Coding
Abstract:
Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo
中文: 本文提出CodingHomo无监督框架,通过利用视频编码中的运动向量并结合掩码引导融合与单应性估计模块,有效提升了单应性估计的精度,在鲁棒性和泛化性方面优于现有先进方法。
English: This paper introduces CodingHomo, an unsupervised framework that enhances homography estimation by leveraging motion vectors from video coding through its Mask-Guided Fusion and Homography Estimation modules, achieving superior robustness and outperforming existing methods.

Authors:Xiaojun Ye, Chun Wang, Yiren Song, Sheng Zhou, Liangcheng Li, Jiajun Bu
Title: FocusedAD: Character-centric Movie Audio Description
Abstract:
Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .
中文摘要:FocusedAD是一种新颖的以角色为中心的框架,通过追踪角色并融合上下文线索来生成与情节相关的电影音频描述,在多个基准测试中实现了最先进的性能。
English Summary: FocusedAD is a novel character-centric framework that generates plot-relevant movie audio descriptions by tracking characters and incorporating contextual cues, achieving state-of-the-art performance across multiple benchmarks.

Authors:Miaosen Luo, Yuncheng Jiang, Sijie Mai
Title: Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis
Abstract:
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture. Our code is released on https://github.com/LuoMSen/KAN-MCP.
中文: KAN-MCP框架通过结合可解释的柯尔莫哥洛夫-阿诺德网络实现跨模态交互透明分析,并采用具有降噪降维功能的多模态清洁帕累托框架增强鲁棒性,有效解决了多模态情感分析中的可解释性与模态不平衡问题。
English: The KAN-MCP framework addresses interpretability and modality imbalance in Multimodal Sentiment Analysis by integrating Kolmogorov-Arnold Networks for transparent cross-modal interaction analysis and the Multimodal Clean Pareto framework with denoising and dimensionality reduction to enhance robustness.

Authors:Shuo Li, Fang Liu, Zehua Hao, Xinyi Wang, Lingling Li, Xu Liu, Puhua Chen, Wenping Ma
Title: Logits DeConfusion with CLIP for Few-Shot Learning
Abstract:
With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at https://github.com/LiShuo1001/LDC.
中文: 提出的Logits DeConfusion方法通过多级适配器融合和类间去混淆模块,有效解决了CLIP在分类任务中的类间混淆问题,显著提升了分类性能。
English: The proposed Logits DeConfusion method, integrating Multi-level Adapter Fusion and Inter-Class Deconfusion modules, effectively mitigates CLIP's inter-class confusion in logits, significantly boosting classification accuracy.

Authors:Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang
Title: DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
Abstract:
Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.
中文: 本研究提出的DC-SAM方法通过增强视觉提示和双重一致性机制,将Segment Anything模型适配于上下文分割任务,在图像和新建立的视频基准测试中均取得了最优性能。
English: The proposed DC-SAM method adapts Segment Anything Models for in-context segmentation by enhancing visual prompts and implementing dual consistency mechanisms, achieving state-of-the-art performance on both image and newly established video benchmarks.

Authors:Yizhuo Wu, Francesco Fioranelli, Chang Gao
Title: RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model
Abstract:
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as Vision Transformer (ViT) and State-Space Model (SSM) architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: https://github.com/lab-emi/AIRHAR.
中文: RadMamba提出了一种参数高效的Mamba状态空间模型,用于雷达人体活动识别,在多个数据集上以极低的计算复杂度实现了顶尖的分类精度。
English: RadMamba introduces a parameter-efficient Mamba State-Space Model for radar-based human activity recognition, achieving top-tier accuracy with drastically reduced computational complexity across multiple datasets.

Authors:Mohamad Dalal, Artur Xarles, Anthony Cioppa, Silvio Giancola, Marc Van Droogenbroeck, Bernard Ghanem, Albert Clapés, Sergio Escalera, Thomas B. Moeslund
Title: Action Anticipation from SoccerNet Football Video Broadcasts
Abstract:
Artificial intelligence has revolutionized the way we analyze sports videos, whether to understand the actions of games in long untrimmed videos or to anticipate the player's motion in future frames. Despite these efforts, little attention has been given to anticipating game actions before they occur. In this work, we introduce the task of action anticipation for football broadcast videos, which consists in predicting future actions in unobserved future frames, within a five- or ten-second anticipation window. To benchmark this task, we release a new dataset, namely the SoccerNet Ball Action Anticipation dataset, based on SoccerNet Ball Action Spotting. Additionally, we propose a Football Action ANticipation TRAnsformer (FAANTRA), a baseline method that adapts FUTR, a state-of-the-art action anticipation model, to predict ball-related actions. To evaluate action anticipation, we introduce new metrics, including mAP@$δ$, which evaluates the temporal precision of predicted future actions, as well as mAP@$\infty$, which evaluates their occurrence within the anticipation window. We also conduct extensive ablation studies to examine the impact of various task settings, input configurations, and model architectures. Experimental results highlight both the feasibility and challenges of action anticipation in football videos, providing valuable insights into the design of predictive models for sports analytics. By forecasting actions before they unfold, our work will enable applications in automated broadcasting, tactical analysis, and player decision-making. Our dataset and code are publicly available at https://github.com/MohamadDalal/FAANTRA.
中文: 本文提出了足球视频中动作预测的新任务,通过FAANTRA模型和新建数据集在5-10秒窗口内实现动作预判,填补了体育视频分析中前瞻性研究的空白。
English: This paper introduces a novel task of anticipating football actions in broadcast videos, proposing the FAANTRA model and a new dataset to benchmark predictive capabilities within 5-10 second windows while addressing current gaps in sports video analysis.

Authors:Heesoo Jung, Hogun Park
Title: Balancing Graph Embedding Smoothness in Self-Supervised Learning via Information-Theoretic Decomposition
Abstract:
Self-supervised learning (SSL) in graphs has garnered significant attention, particularly in employing Graph Neural Networks (GNNs) with pretext tasks initially designed for other domains, such as contrastive learning and feature reconstruction. However, it remains uncertain whether these methods effectively reflect essential graph properties, precisely representation similarity with its neighbors. We observe that existing methods position opposite ends of a spectrum driven by the graph embedding smoothness, with each end corresponding to outperformance on specific downstream tasks. Decomposing the SSL objective into three terms via an information-theoretic framework with a neighbor representation variable reveals that this polarization stems from an imbalance among the terms, which existing methods may not effectively maintain. Further insights suggest that balancing between the extremes can lead to improved performance across a wider range of downstream tasks. A framework, BSG (Balancing Smoothness in Graph SSL), introduces novel loss functions designed to supplement the representation quality in graph-based SSL by balancing the derived three terms: neighbor loss, minimal loss, and divergence loss. We present a theoretical analysis of the effects of these loss functions, highlighting their significance from both the SSL and graph smoothness perspectives. Extensive experiments on multiple real-world datasets across node classification and link prediction consistently demonstrate that BSG achieves state-of-the-art performance, outperforming existing methods. Our implementation code is available at https://github.com/steve30572/BSG.
中文摘要:BSG框架通过平衡三个损失函数解决了图自监督学习中的不平衡问题,在多种下游任务中实现了最先进的性能。
English Summary: The BSG framework addresses the imbalance in self-supervised graph learning by introducing three balanced loss functions, achieving state-of-the-art performance across various downstream tasks.

Authors:Pascal Schlachter, Jonathan Fuss, Bin Yang
Title: Analysis of Pseudo-Labeling for Online Source-Free Universal Domain Adaptation
Abstract:
A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy (CE) loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at https://github.com/pascalschlachter/PLAnalysis.
中文: 本研究系统分析了在线无源通用域自适应中的伪标签技术,揭示了当前方法与理想性能间的显著差距,并证明对比损失在中等精度伪标签下表现稳健,而交叉熵损失在接近完美伪标签时更优,强调伪标签质量比数量更为关键。
English: This study systematically analyzes pseudo-labeling in online source-free universal domain adaptation, revealing a significant performance gap from ideal conditions and demonstrating that contrastive loss adapts well with moderate pseudo-label accuracy while cross-entropy loss excels with near-perfect labels, emphasizing the importance of prioritizing pseudo-label quality over quantity.

Authors:Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa
Title: LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Abstract:
Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.22 (EM) and 0.40 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.
中文: 本研究证明,使用大型语言模型作为阅读理解问答模型的评估工具,能显著提升与人类判断的相关性,优于传统的精确匹配和F1分数指标。
English: This study demonstrates that using large language models as judges for evaluating reading comprehension QA models significantly improves correlation with human judgments, outperforming traditional metrics like EM and F1-score.

Authors:Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen
Title: Exploring Video-Based Driver Activity Recognition under Noisy Labels
Abstract:
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.
Chinese: 本文提出了首个用于驾驶员活动识别的标签噪声学习方法,通过基于聚类的表示学习和无需超参数的样本选择策略,在Drive&Act数据集上实现了优于其他方法的性能。
English: This paper introduces the first label noise learning method for driver activity recognition, utilizing clustering-based representation learning and a hyperparameter-free sample selection strategy to achieve superior performance on the Drive&Act dataset.

Authors:Xia Deng, Shen Chen, Jiale Zhou, Lei Li
Title: Mind2Matter: Creating 3D Models from EEG Signals
Abstract:
The reconstruction of 3D objects from brain signals has gained significant attention in brain-computer interface (BCI) research. Current research predominantly utilizes functional magnetic resonance imaging (fMRI) for 3D reconstruction tasks due to its excellent spatial resolution. Nevertheless, the clinical utility of fMRI is limited by its prohibitive costs and inability to support real-time operations. In comparison, electroencephalography (EEG) presents distinct advantages as an affordable, non-invasive, and mobile solution for real-time brain-computer interaction systems. While recent advances in deep learning have enabled remarkable progress in image generation from neural data, decoding EEG signals into structured 3D representations remains largely unexplored. In this paper, we propose a novel framework that translates EEG recordings into 3D object reconstructions by leveraging neural decoding techniques and generative models. Our approach involves training an EEG encoder to extract spatiotemporal visual features, fine-tuning a large language model to interpret these features into descriptive multimodal outputs, and leveraging generative 3D Gaussians with layout-guided control to synthesize the final 3D structures. Experiments demonstrate that our model captures salient geometric and semantic features, paving the way for applications in brain-computer interfaces (BCIs), virtual reality, and neuroprosthetics. Our code is available in https://github.com/sddwwww/Mind2Matter.
中文: 本研究提出了一种创新框架,通过神经解码和生成模型将脑电图信号转化为三维物体重建,为脑机接口和虚拟现实应用开辟了新途径。
English: This study introduces a novel framework that translates EEG signals into 3D object reconstructions using neural decoding and generative models, demonstrating potential for brain-computer interfaces and virtual reality applications.

Authors:Lvpan Cai, Haowei Wang, Jiayi Ji, YanShu ZhouMen, Yiwei Ma, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji
Title: Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
Abstract:
The rise of AI-generated image editing tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated Perception-Creation-Evaluation pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the generalization traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks. All data and codes are available at https://github.com/clpbc/BR-Gen.
中文摘要:本文提出了BR-Gen这一针对现有方法忽视的场景级图像伪造的大规模数据集,并开发了NFA-ViT噪声引导视觉Transformer,通过特征增强和跨区域注意力机制有效提升局部伪造检测能力。
English Summary: This paper introduces BR-Gen, a large-scale dataset addressing scene-level image forgeries overlooked by existing methods, and proposes NFA-ViT, a noise-guided vision transformer that enhances localized forgery detection through feature amplification and cross-region attention mechanisms.

Authors:Qishan Wang, Jia Guo, Shuyong Gao, Haofen Wang, Li Xiong, Junjie Hu, Hanqi Guo, Wenqiang Zhang
Title: Search is All You Need for Few-shot Anomaly Detection
Abstract:
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.
Chinese: VisionAD提出了一种简单高效的近邻搜索框架,用于少样本异常检测,无需训练或复杂提示工程,即在多个基准测试中显著超越现有最优方法。
English: VisionAD introduces a simple yet effective nearest-neighbor framework for few-shot anomaly detection, outperforming state-of-the-art methods across multiple benchmarks without requiring training or complex prompt engineering.

Authors:Yushuai Sun, Zikun Zhou, Dongmei Jiang, Yaowei Wang, Jun Yu, Guangming Lu, Wenjie Pei
Title: Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
Abstract:
Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. Our code and model are available at https://github.com/Bunny-Black/PrunNet.
Chinese: 本文提出了一种具有自兼容性的可修剪网络,通过训练后修剪能够按需生成任意容量的兼容子网络,无需额外训练即可适配新平台部署。
English: This paper introduces a Prunable Network with self-compatibility that enables the generation of compatible subnetworks at any capacity through post-training pruning, eliminating the need for additional training when deploying to new platforms.

Authors:Thu Hang Khuat, Duy-Nam Bui, Hoa TT. Nguyen, Mien L. Trinh, Minh T. Nguyen, Manh Duong Phung
Title: Multi-goal Rapidly Exploring Random Tree with Safety and Dynamic Constraints for UAV Cooperative Path Planning
Abstract:
Cooperative path planning is gaining its importance due to the increasing demand on using multiple unmanned aerial vehicles (UAVs) for complex missions. This work addresses the problem by introducing a new algorithm named MultiRRT that extends the rapidly exploring random tree (RRT) to generate paths for a group of UAVs to reach multiple goal locations at the same time. We first derive the dynamics constraint of the UAV and include it in the problem formulation. MultiRRT is then developed, taking into account the cooperative requirements and safe constraints during its path-searching process. The algorithm features two new mechanisms, node reduction and Bezier interpolation, to ensure the feasibility and optimality of the paths generated. Importantly, the interpolated paths are proven to meet the safety and dynamics constraints imposed by obstacles and the UAVs. A number of simulations, comparisons, and experiments have been conducted to evaluate the performance of the proposed approach. The results show that MultiRRT can generate collision-free paths for multiple UAVs to reach their goals with better scores in path length and smoothness metrics than state-of-the-art RRT variants including Theta-RRT, FN-RRT, RRT*, and RRT*-Smart. The generated paths are also tested in practical flights with real UAVs to evaluate their validity for cooperative tasks. The source code of the algorithm is available at https://github.com/duynamrcv/multi-target_RRT
中文摘要:本文提出MultiRRT算法,通过扩展快速探索随机树(RRT)方法,结合节点缩减和贝塞尔插值机制,为多架无人机实现同步抵达不同目标点的协同路径规划,在仿真和实际飞行测试中均展现出优于主流RRT变体算法的路径性能。
English Summary: This paper introduces MultiRRT, an enhanced algorithm that extends RRT to enable multiple UAVs to simultaneously reach different destinations while meeting cooperative requirements and safety constraints through novel node reduction and Bezier interpolation techniques, demonstrating superior performance in simulations and real-world tests compared to existing methods.

Authors:Kishan Gurumurthy, Himanshu Pal, Charu Sharma
Title: Federated Spectral Graph Transformers Meet Neural Ordinary Differential Equations for Non-IID Graphs
Abstract:
Graph Neural Network (GNN) research is rapidly advancing due to GNNs' capacity to learn distributed representations from graph-structured data. However, centralizing large volumes of real-world graph data for GNN training is often impractical due to privacy concerns, regulatory restrictions, and commercial competition. Federated learning (FL), a distributed learning paradigm, offers a solution by preserving data privacy with collaborative model training. Despite progress in training huge vision and language models, federated learning for GNNs remains underexplored. To address this challenge, we present a novel method for federated learning on GNNs based on spectral GNNs equipped with neural ordinary differential equations (ODE) for better information capture, showing promising results across both homophilic and heterophilic graphs. Our approach effectively handles non-Independent and Identically Distributed (non-IID) data, while also achieving performance comparable to existing methods that only operate on IID data. It is designed to be privacy-preserving and bandwidth-optimized, making it suitable for real-world applications such as social network analysis, recommendation systems, and fraud detection, which often involve complex, non-IID, and heterophilic graph structures. Our results in the area of federated learning on non-IID heterophilic graphs demonstrate significant improvements, while also achieving better performance on homophilic graphs. This work highlights the potential of federated learning in diverse and challenging graph settings. Open-source code available on GitHub (https://github.com/SpringWiz11/Fed-GNODEFormer).
中文: 本文提出了一种基于谱图神经网络和神经常微分方程的新型联邦学习方法,能够有效处理非独立同分布和异配性图数据,在保持隐私和带宽优化的同时,在各类图结构上均展现出优越性能。
English: This paper introduces a novel federated learning method for Graph Neural Networks using spectral GNNs with neural ODEs, which effectively handles non-IID and heterophilic graphs while maintaining privacy and bandwidth efficiency, demonstrating superior performance across various graph types.

Authors:Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
Title: GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision
Abstract:
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
Chinese: 本文提出GrabS,一种两阶段无监督方法,通过从数据集中学习以对象为中心的先验知识,并利用具身智能体查询发现对象,在无需人工标注的情况下实现了超越现有方法的3D分割性能。
English: This paper introduces GrabS, a two-stage unsupervised method that learns object-centric priors from datasets and uses an embodied agent to discover objects, achieving superior 3D segmentation performance over existing approaches without human labels.

Authors:Zongye Zhang, Wenrui Cai, Qingjie Liu, Yunhong Wang
Title: SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation
Abstract:
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: https://github.com/zzysteve/SkeletonX
中文摘要:本文提出SkeletonX轻量级训练框架,通过构建样本对和特征聚合模块,利用表演者差异性和动作共性,有效提升基于GCN的骨架动作识别模型在有限标注数据下的性能表现。
English summary: This paper introduces SkeletonX, a lightweight training pipeline that enhances skeleton action recognition with limited labeled data by leveraging performer variability and action commonality through sample pair construction and feature aggregation.

Authors:Muhammad Shahid Muneer, Simon S. Woo
Title: Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset
Abstract:
In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: https://github.com/shahidmuneer/multimodal-nsfw-defense.
中文: 本研究通过构建百万规模的多模态数据集和开发鲁棒防御系统,有效提升了文本到图像模型对不良内容的检测能力,显著降低了多模态对抗攻击的成功率。
English: This research addresses the vulnerability of Text-to-Image models to adversarial attacks by introducing a million-scale multimodal dataset and a robust defense system that significantly enhances NSFW content detection and reduces attack success rates.

Authors:Xingwu Ji, Haochen Niu, Dexin Duan, Rendong Ying, Fei Wen, Peilin Liu
Title: An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World
Abstract:
Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: https://github.com/jixingwu/SOL-SLAM.
中文: 本文提出了一种自监督在线自适应单目视觉里程计框架,通过轻量级深度优化和场景语义整合,实现了对多样化新环境的高效泛化能力。
English: This paper introduces a self-supervised online adaptation framework for monocular visual odometry that enables efficient generalization to diverse environments through lightweight depth refinement and contextual semantic integration.

Authors:Amirhossein Dadashzadeh, Parsa Esmati, Majid Mirmehdi
Title: Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Abstract:
Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: https://github.com/Plrbear/Co-Star
中文:Co-STAR框架通过将课程学习与教师模型和CLIP的协作自训练相结合,采用基于可靠性的权重函数和自适应课程正则化,有效解决了伪标签噪声和过度自信预测问题,在多个视频领域自适应基准测试中表现优异。
English: The Co-STAR framework enhances source-free unsupervised video domain adaptation by combining curriculum learning with collaborative self-training between a teacher model and CLIP, using reliability-based weighting and adaptive regularization to address noisy pseudo-labels and over-confident predictions, achieving superior performance across benchmarks.

Authors:Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava
Title: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Abstract:
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code is available at https://github.com/LeanModels/DFloat11.
中文: 本文提出DFloat11无损压缩框架,通过动态长度编码和定制GPU解压内核,将大型AI模型体积减少30%同时保持输出结果完全一致。
English: This paper introduces DFloat11, a lossless compression framework that reduces large AI model sizes by 30% while maintaining bit-for-bit identical outputs through dynamic-length encoding and custom GPU decompression kernels.

Authors:Dong Wang, Hannes Haag, Daniel Casado Herraez, Stefan May, Cyrill Stachniss, Andreas Nüchter
Title: Doppler-SLAM: Doppler-Aided Radar-Inertial and LiDAR-Inertial Simultaneous Localization and Mapping
Abstract:
Simultaneous localization and mapping (SLAM) is a critical capability for autonomous systems. Traditional SLAM approaches, which often rely on visual or LiDAR sensors, face significant challenges in adverse conditions such as low light or featureless environments. To overcome these limitations, we propose a novel Doppler-aided radar-inertial and LiDAR-inertial SLAM framework that leverages the complementary strengths of 4D radar, FMCW LiDAR, and inertial measurement units. Our system integrates Doppler velocity measurements and spatial data into a tightly-coupled front-end and graph optimization back-end to provide enhanced ego velocity estimation, accurate odometry, and robust mapping. We also introduce a Doppler-based scan-matching technique to improve front-end odometry in dynamic environments. In addition, our framework incorporates an innovative online extrinsic calibration mechanism, utilizing Doppler velocity and loop closure to dynamically maintain sensor alignment. Extensive evaluations on both public and proprietary datasets show that our system significantly outperforms state-of-the-art radar-SLAM and LiDAR-SLAM frameworks in terms of accuracy and robustness. To encourage further research, the code of our Doppler-SLAM and our dataset are available at: https://github.com/Wayne-DWA/Doppler-SLAM.
Chinese: 本文提出了一种新型的多普勒辅助雷达-惯性与激光雷达-惯性SLAM框架,通过融合多普勒速度测量和空间数据,提升了自身速度估计、里程计精度和建图鲁棒性,在多种环境下均优于现有方法。
English: This paper introduces a novel Doppler-aided radar-inertial and LiDAR-inertial SLAM framework that integrates Doppler velocity measurements with spatial data to enhance ego velocity estimation, odometry accuracy, and mapping robustness, outperforming existing methods in various conditions.

Authors:Tianjian Yang, Wei Vivian Li
Title: Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations
Abstract:
Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.
Chinese: 提出的广义概率典型相关分析(GPCCA)是一种无监督方法,能有效整合多模态数据、处理缺失值,并提供稳健的低维嵌入,从而在多种应用中提升聚类和分析效果。
English: The proposed Generalized Probabilistic Canonical Correlation Analysis (GPCCA) is an unsupervised method that effectively integrates multi-modal data, handles missing values, and provides robust low-dimensional embeddings for improved clustering and analysis across various applications.

Authors:Ziyu Cao, William Talbot, Kailai Li
Title: RESPLE: Recursive Spline Estimation for LiDAR-Based Odometry
Abstract:
We present a novel recursive Bayesian estimation framework using B-splines for continuous-time 6-DoF dynamic motion estimation. The state vector consists of a recurrent set of position control points and orientation control point increments, enabling efficient estimation via a modified iterated extended Kalman filter without involving error-state formulations. The resulting recursive spline estimator (RESPLE) is further leveraged to develop a versatile suite of direct LiDAR-based odometry solutions, supporting the integration of one or multiple LiDARs and an IMU. We conduct extensive real-world evaluations using public datasets and our own experiments, covering diverse sensor setups, platforms, and environments. Compared to existing systems, RESPLE achieves comparable or superior estimation accuracy and robustness, while attaining real-time efficiency. Our results and analysis demonstrate RESPLE's strength in handling highly dynamic motions and complex scenes within a lightweight and flexible design, showing strong potential as a universal framework for multi-sensor motion estimation. We release the source code and experimental datasets at https://github.com/ASIG-X/RESPLE .
中文摘要:本文提出了一种基于B样条的递归样条估计器RESPLE,用于六自由度连续时间动态运动估计,该框架在多种传感器配置下实现了实时高效处理,并在动态运动和复杂场景中表现出卓越的估计精度与鲁棒性。
English Summary: This paper introduces RESPLE, a novel recursive spline estimator using B-splines for continuous-time 6-DoF motion estimation, which achieves real-time efficiency and superior accuracy in handling dynamic motions across diverse sensor setups.

Authors:Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun
Title: NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
Abstract:
Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.
中文: NodeRAG提出了一种以图为中心的框架,采用异构图结构将基于图的方法无缝整合到检索增强生成中,显著提高了效率,并在索引、查询速度、存储和问答性能上优于现有方法。
English: NodeRAG introduces a graph-centric framework using heterogeneous graph structures to seamlessly integrate graph-based methodologies into retrieval-augmented generation, enhancing efficiency and outperforming previous methods in indexing, query speed, storage, and question-answering performance.

Authors:Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani
Title: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Abstract:
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.
中文摘要:REAL是一个基于真实网站确定性模拟的多轮智能体评估基准与框架,包含112项实际任务,实证显示前沿语言模型成功率最高仅达41%,揭示了自主网络导航能力的重大不足。
English Summary: REAL is a benchmark and framework for evaluating multi-turn agents using deterministic simulations of real-world websites, featuring 112 practical tasks that reveal frontier language models achieve only up to 41% success rate, highlighting significant gaps in autonomous web navigation capabilities.

Authors:Mansoor Hayat, Supavadee Aramvith, Subrata Bhattacharjee, Nouman Ahmad
Title: Attention GhostUNet++: Enhanced Segmentation of Adipose Tissue and Liver in CT Images
Abstract:
Accurate segmentation of abdominal adipose tissue, including subcutaneous (SAT) and visceral adipose tissue (VAT), along with liver segmentation, is essential for understanding body composition and associated health risks such as type 2 diabetes and cardiovascular disease. This study proposes Attention GhostUNet++, a novel deep learning model incorporating Channel, Spatial, and Depth Attention mechanisms into the Ghost UNet++ bottleneck for automated, precise segmentation. Evaluated on the AATTCT-IDS and LiTS datasets, the model achieved Dice coefficients of 0.9430 for VAT, 0.9639 for SAT, and 0.9652 for liver segmentation, surpassing baseline models. Despite minor limitations in boundary detail segmentation, the proposed model significantly enhances feature refinement, contextual understanding, and computational efficiency, offering a robust solution for body composition analysis. The implementation of the proposed Attention GhostUNet++ model is available at:https://github.com/MansoorHayat777/Attention-GhostUNetPlusPlus.
中文摘要:本研究提出Attention GhostUNet++深度学习模型,通过集成多种注意力机制实现腹部脂肪组织和肝脏的精准分割,在保持高计算效率的同时显著超越了基准模型的性能表现。
English Summary: The study introduces Attention GhostUNet++, a deep learning model integrating multiple attention mechanisms for precise abdominal adipose and liver tissue segmentation, achieving superior performance over baseline models with enhanced computational efficiency.

Authors:Huaxiang Zhang, Hao Zhang, Aoran Mei, Zhongxue Gan, Guo-Niu Zhu
Title: SO-DETR: Leveraging Dual-Domain Features and Knowledge Distillation for Small Object Detection
Abstract:
Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at https://github.com/ValiantDiligent/SO_DETR.
中文摘要:本文提出SO-DETR模型,通过双域混合编码器、优化查询选择机制和知识蒸馏策略,有效提升小物体检测性能,在基准数据集上表现优于现有方法。
English Summary: The paper introduces SO-DETR, a transformer-based model that enhances small object detection through a dual-domain hybrid encoder, improved query selection, and knowledge distillation, achieving superior performance on benchmark datasets.

Authors:Ziqi Pang, Xin Xu, Yu-Xiong Wang
Title: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Abstract:
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.
中文: 本研究通过分析生成扩散过程与感知任务的匹配差距,针对不同去噪步骤提出定制化学习目标和数据增强方法,显著提升了扩散模型在判别任务中的感知精度和交互性,无需改动架构即可实现最优性能。
English: This study enhances generative diffusion models for discriminative tasks by addressing critical gaps in alignment, focusing on tailored learning objectives for different denoising steps and diffusion-specific data augmentation to improve perception accuracy and interactivity, achieving state-of-the-art results without architectural changes.

Authors:Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
Title: SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Abstract:
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.
中文:SimpleAR是一种简单的自回归框架,仅用5亿参数即可生成高保真的1024x1024图像,在多项基准测试中表现优异,并通过监督微调和GRPO训练显著提升生成质量,同时借助推理加速技术将生成时间缩短至约14秒。
English: SimpleAR is a straightforward autoregressive framework that generates high-fidelity 1024x1024 images with just 0.5B parameters, achieving competitive benchmark results and significant improvements through SFT and GRPO training, while reducing generation time to about 14 seconds with inference acceleration.

Authors:Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, Jakob Nicolaus Foerster
Title: A Clean Slate for Offline Reinforcement Learning
Abstract:
Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at https://github.com/EmptyJackson/unifloral.
中文摘要:该摘要指出离线强化学习面临实现不一致和评估不公等挑战,并提出了一个统一框架,包含简洁实现和严格评估协议,从而开发出超越现有方法的新算法。
English Summary: The abstract highlights challenges in offline RL, such as inconsistent implementations and unfair evaluations, and introduces a unified framework with clean implementations and a rigorous protocol that leads to novel algorithms outperforming existing methods.

Authors:An Zhao, Shengyuan Zhang, Ling Yang, Zejian Li, Jiale Wu, Haoran Xu, AnYang Wei, Perry Pengyun GU, Lingyun Sun
Title: Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion
Abstract:
The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion's slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on https://github.com/happyw1nd/DistillationDPO.
Chinese: 本文提出Distillation-DPO,一种新颖的扩散蒸馏框架,通过策略优化对齐偏好来增强3D LiDAR场景补全,相比现有方法实现了5倍以上的速度提升和更高质量的结果。
English: This paper introduces Distillation-DPO, a novel diffusion distillation framework that enhances 3D LiDAR scene completion by aligning preferences through policy optimization, achieving over 5-fold speed improvement and higher quality results compared to existing methods.

Authors:Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
Title: TextArena
Abstract:
TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.
中文: TextArena是一个开源的文本竞技游戏集合,包含57+种独特环境,专门用于训练和评估大语言模型的动态社交能力(如谈判与欺骗),通过可扩展框架和实时评分系统弥补传统基准测试的不足。
English: TextArena is an open-source platform featuring over 57 competitive text-based games designed to train and evaluate social skills like negotiation and deception in LLMs, addressing gaps in traditional benchmarks through its extensible framework and real-time scoring system.

Authors:Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, Nantheera Anantrasirichai
Title: Mamba-Based Ensemble learning for White Blood Cell Classification
Abstract:
White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at https://github.com/LewisClifton/Mamba-WBC-Classification.
中文:本文提出了一种结合Mamba模型与集成学习的新框架,通过线性复杂度模型提升白细胞分类效率,并发布了Chula-WBC-8基准数据集,为资源受限环境提供了可行的解决方案。
English: This paper introduces a novel framework using Mamba models with ensemble learning to improve white blood cell classification, offering a scalable and efficient alternative to Transformer-based methods while introducing a new benchmark dataset.

Authors:Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Abstract:
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
Chinese: 本文提出了一种双空间知识蒸馏(DSKD)框架,通过投影隐藏状态和对齐标记来统一师生模型的输出空间,实现了不同词汇表大语言模型间的有效知识迁移,并在多个基准测试中显著优于现有方法。
English: This paper introduces a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of teacher and student models by projecting hidden states and aligning tokens, enabling effective distillation between large language models with different vocabularies and outperforming existing methods.

Authors:Panagiotis Agrafiotis, Begüm Demir
Title: Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps
Abstract:
Accurate, detailed, and high-frequent bathymetry is crucial for shallow seabed areas facing intense climatological and anthropogenic pressures. Current methods utilizing airborne or satellite optical imagery to derive bathymetry primarily rely on either SfM-MVS with refraction correction or Spectrally Derived Bathymetry (SDB). However, SDB methods often require extensive manual fieldwork or costly reference data, while SfM-MVS approaches face challenges even after refraction correction. These include depth data gaps and noise in environments with homogeneous visual textures, which hinder the creation of accurate and complete Digital Surface Models (DSMs) of the seabed. To address these challenges, this work introduces a methodology that combines the high-fidelity 3D reconstruction capabilities of the SfM-MVS methods with state-of-the-art refraction correction techniques, along with the spectral analysis capabilities of a new deep learning-based method for bathymetry prediction. This integration enables a synergistic approach where SfM-MVS derived DSMs with data gaps are used as training data to generate complete bathymetric maps. In this context, we propose Swin-BathyUNet that combines U-Net with Swin Transformer self-attention layers and a cross-attention mechanism, specifically tailored for SDB. Swin-BathyUNet is designed to improve bathymetric accuracy by capturing long-range spatial relationships and can also function as a standalone solution for standard SDB with various training depth data, independent of the SfM-MVS output. Experimental results in two completely different test sites in the Mediterranean and Baltic Seas demonstrate the effectiveness of the proposed approach through extensive experiments that demonstrate improvements in bathymetric accuracy, detail, coverage, and noise reduction in the predicted DSM. The code is available at https://github.com/pagraf/Swin-BathyUNet.
中文: 本研究提出Swin-BathyUNet深度学习模型,通过结合SfM-MVS与光谱分析及折射校正技术,在地中海和波罗的海的实验中生成高精度、全覆盖的海底地形图,显著提升了测深精度并降低了噪声干扰。
English: This study introduces Swin-BathyUNet, a deep learning model that integrates SfM-MVS with spectral analysis and refraction correction to produce accurate, high-coverage seabed maps, demonstrating improved bathymetric precision and reduced noise in Mediterranean and Baltic Sea tests.

Authors:Liu Yang, Huiyu Duan, Yucheng Zhu, Xiaohong Liu, Lu Liu, Zitong Xu, Guangji Ma, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Title: Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model
Abstract:
$360^{\circ}$ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360$^{\circ}$ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf{\textit{Any2Omni}}, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni$^2$ model for both the ODI generation and editing tasks. Both the Any2Omni dataset and the Omni$^2$ model are publicly available at: https://github.com/IntMeGroup/Omni2.
中文: 作者提出了首个全面的全向图像生成与编辑数据集Any2Omni,并开发了Omni²模型,该单一模型能有效处理多种全向图像任务,实验证明其具有卓越性能。
English: The authors introduce Any2Omni, a comprehensive dataset for omnidirectional image (ODI) generation and editing, and propose the Omni² model, which effectively handles multiple ODI tasks using a single framework, demonstrating superior performance in experiments.

Authors:Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Tao Tan, Vicente Grau, Jungong Han
Title: From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.
中文摘要:本研究提出一种师生框架,通过整合临床医生的注视数据和视觉语言模型来增强医学图像分割效果,在多个数据集上实现Dice分数3-5%的提升,同时保持了临床可解释性。
English Summary: This study introduces a teacher-student framework that synergistically integrates clinician gaze data with vision-language models to enhance medical image segmentation, achieving 3-5% Dice score improvements across multiple datasets while preserving clinical interpretability.

Authors:Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Vicente Grau, Jungong Han
Title: From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.
中文摘要:本研究提出一种师生框架,通过整合临床医生的注视数据和视觉语言模型来增强医学图像分割效果,在多个数据集上实现Dice分数3-5%的提升,同时保持了临床可解释性。
English Summary: This study introduces a teacher-student framework that synergistically integrates clinician gaze data with vision-language models to enhance medical image segmentation, achieving 3-5% Dice score improvements across multiple datasets while preserving clinical interpretability.

Authors:Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
Title: Offline Learning and Forgetting for Reasoning with Large Language Models
Abstract:
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
Chinese: 该方法通过利用搜索生成的成功与失败路径对大型语言模型进行微调,在显著提升推理成功率的同时,大幅降低了推理时间,优于传统搜索方法。
English: The proposed method enhances reasoning in large language models by fine-tuning them with search-generated successful and failed paths, significantly improving success rates and reducing inference time compared to traditional search-based approaches.

Authors:Yuezhe Yang, Boyu Yang, Yaqian Wang, Yang He, Xingbo Dong, Zhe Jin
Title: Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A Systematic Review
Abstract:
The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: https://github.com/Bean-Young/AI4Radiology.
中文: 本综述探讨了放射影像中基于人工智能的先进三维重建算法,将其分为显式和隐式方法,并讨论了评估指标、当前挑战及未来方向,旨在提升临床诊断效果。
English: This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches, while also addressing evaluation metrics, challenges, and future directions to enhance clinical diagnosis.

Authors:Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu
Title: Autoregressive Distillation of Diffusion Transformers
Abstract:
Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: https://github.com/alsdudrla10/ARD.
Chinese: 提出的自回归蒸馏(ARD)方法通过利用历史ODE轨迹数据和改进的Transformer架构,有效解决了扩散模型中的曝光偏差问题,在极小计算开销下实现了更优的图像生成质量与效率。
English: The proposed AutoRegressive Distillation (ARD) method addresses exposure bias in diffusion models by utilizing historical ODE trajectory data with modified transformer architecture, achieving superior image generation quality and efficiency with minimal computational overhead.

Authors:Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
Abstract:
This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.
中文: UniAnimate-DiT项目基于Wan2.1模型,通过LoRA微调和轻量级姿态编码器实现了高保真、时序一致的人像动画,并展现出从480p到720p的强大泛化能力。
English: UniAnimate-DiT utilizes the Wan2.1 model with LoRA fine-tuning and a lightweight pose encoder to achieve high-fidelity, temporally consistent human animations that generalize well from 480p to 720p resolution.

Authors:Xinning Chai, Yao Zhang, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song
Title: Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution
Abstract:
Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at https://github.com/Yaozzz666/DSCF-SR.
Chinese: 提出的DSCLoRA方法通过结合低秩适应和知识蒸馏,提升了轻量级超分辨率模型的性能,在不增加计算成本的情况下获得了更高的PSNR和SSIM分数,并在NTIRE 2025挑战赛中荣获第一名。
English: The proposed DSCLoRA method enhances the performance of lightweight super-resolution models by integrating low-rank adaptation and knowledge distillation, achieving superior PSNR and SSIM scores without increasing computational costs, as evidenced by its first-place ranking in the NTIRE 2025 challenge.

Authors:Hannes Petrenz, Johannes Köhler, Francesco Borrelli
Title: Robust MPC for Uncertain Linear Systems -- Combining Model Adaptation and Iterative Learning
Abstract:
This paper presents a robust adaptive learning Model Predictive Control (MPC) framework for linear systems with parametric uncertainties and additive disturbances performing iterative tasks. The approach refines the parameter estimates online using set-membership estimation. Performance enhancement over iterations is achieved by learning the terminal cost from data. Safety is enforced using a terminal set, which is also learned iteratively. The proposed method guarantees recursive feasibility, constraint satisfaction, and a robust bound on the closed-loop cost. Numerical simulations on a mass-spring-damper system demonstrate improved computational efficiency and control performance compared to a robust adaptive MPC scheme without iterative learning of the terminal ingredients.
中文: 本文提出了一种鲁棒自适应学习模型预测控制框架,通过在线优化参数估计并迭代学习终端要素,有效提升了存在不确定性的线性系统的控制性能和计算效率。
English: This paper introduces a robust adaptive learning MPC framework that enhances control performance and computational efficiency for linear systems under uncertainties by iteratively refining parameter estimates and learning terminal components from data.

Authors:Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Title: R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
Abstract:
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.
中文: 针对视觉语言模型易受对抗攻击的问题,本文提出R-TPT方法,通过优化熵目标和加权集成策略,在无需标注数据的情况下实现推理阶段的灵活防御。
English: Vision-language models face heightened adversarial risks, so this paper introduces R-TPT, a test-time prompt tuning method that strengthens defenses without labeled data by optimizing entropy and employing weighted ensembling.

Authors:Taewook Kang, Bum-Jae You, Juyoun Park, Yisoo Lee
Title: A real-time anomaly detection method for robots based on a flexible and sparse latent space
Abstract:
The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code is available at https://github.com/twkang43/sparse-maf-aae.
中文: 本文提出了一种基于稀疏掩码自回归流的对抗自编码器模型,通过优化特征聚焦和潜空间灵活性,在机器人抓放任务和碰撞场景中显著提升了实时异常检测性能,同时实现了毫秒级推理速度。
English: This paper introduces a Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model that enhances real-time anomaly detection in robotics by improving feature focus and latent space flexibility, achieving superior performance in pick-and-place tasks and collision scenarios while ensuring millisecond-level inference speeds.

Authors:P. Tomkiewicz, J. Jaworski, P. Zielonka, A. Wilinski
Title: K-means Enhanced Density Gradient Analysis for Urban and Transport Metrics Using Multi-Modal Satellite Imagery
Abstract:
This paper presents a novel computational approach for evaluating urban metrics through density gradient analysis using multi-modal satellite imagery, with applications including public transport and other urban systems. By combining optical and Synthetic Aperture Radar (SAR) data, we develop a method to segment urban areas, identify urban centers, and quantify density gradients. Our approach calculates two key metrics: the density gradient coefficient ($α$) and the minimum effective distance (LD) at which density reaches a target threshold. We further employ machine learning techniques, specifically K-means clustering, to objectively identify uniform and high-variability regions within density gradient plots. We demonstrate that these metrics provide an effective screening tool for public transport analyses by revealing the underlying urban structure. Through comparative analysis of two representative cities with contrasting urban morphologies (monocentric vs polycentric), we establish relationships between density gradient characteristics and public transport network topologies. Cities with clear density peaks in their gradient plots indicate distinct urban centers requiring different transport strategies than those with more uniform density distributions. This methodology offers urban planners a cost-effective, globally applicable approach to preliminary public transport assessment using freely available satellite data. The complete implementation, with additional examples and documentation, is available in an open-source repository under the MIT license at https://github.com/nexri/Satellite-Imagery-Urban-Analysis.
中文: 本文提出了一种利用多模态卫星影像和机器学习分析城市密度梯度的新计算方法,为基于城市结构的公共交通系统评估提供了一种经济有效的工具。
English: This paper introduces a novel computational method using multi-modal satellite imagery and machine learning to analyze urban density gradients, providing a cost-effective tool for evaluating public transport systems based on urban structure.

Authors:Elman Ghazaei, Erchan Aptoula
Title: Change State Space Models for Remote Sensing Change Detection
Abstract:
Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.
Chinese: 变化状态空间模型(CSSM)是一种专为遥感变化检测设计的新型架构,通过聚焦双时相图像间的相关变化来减少参数和计算复杂度,在基准数据集上以更低计算成本超越了卷积网络、视觉Transformer和基于Mamba的模型。
English: The Change State Space Model (CSSM) is a novel architecture designed specifically for remote sensing change detection, which efficiently focuses on relevant changes between bi-temporal images to reduce parameters and computational complexity while outperforming ConvNets, ViTs, and Mamba-based models on benchmark datasets.

Authors:Dongmin Kim, Hoshinori Kanazawa, Naoto Yoshida, Yasuo Kuniyoshi
Title: Emergence of Goal-Directed Behaviors via Active Inference with Self-Prior
Abstract:
Infants often exhibit goal-directed behaviors, such as reaching for a sensory stimulus, even when no external reward criterion is provided. These intrinsically motivated behaviors facilitate spontaneous exploration and learning of the body and environment during early developmental stages. Although computational modeling can offer insight into the mechanisms underlying such behaviors, many existing studies on intrinsic motivation focus primarily on how exploration contributes to acquiring external rewards. In this paper, we propose a novel density model for an agent's own multimodal sensory experiences, called the "self-prior," and investigate whether it can autonomously induce goal-directed behavior. Integrated within an active inference framework based on the free energy principle, the self-prior generates behavioral references purely from an intrinsic process that minimizes mismatches between average past sensory experiences and current observations. This mechanism is also analogous to the acquisition and utilization of a body schema through continuous interaction with the environment. We examine this approach in a simulated environment and confirm that the agent spontaneously reaches toward a tactile stimulus. Our study implements intrinsically motivated behavior shaped by the agent's own sensory experiences, demonstrating the spontaneous emergence of intentional behavior during early development.
中文摘要:本文提出一种"自我先验"模型,通过最小化历史与当前感官体验差异的内在驱动机制,使智能体能够自主产生目标导向行为,并在仿真中实现了自发的触觉伸手动作。
English Summary: This paper introduces a "self-prior" model that enables autonomous agents to generate goal-directed behaviors through intrinsic motivation by minimizing discrepancies between past and current sensory experiences, demonstrating spontaneous tactile reaching in simulations.

Authors:Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou
Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection
Abstract:
Anomaly Detection involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible. Recently, the rich pretraining knowledge of CLIP has shown promising zero-shot generalization in detecting anomalies without the need for training samples from target domains. However, CLIP's coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies due to: (1) spatial misalignment, and (2) the limited sensitivity of global features to local anomalous patterns. In this paper, we propose Crane which tackles both problems. First, we introduce a correlation-based attention module to retain spatial alignment more accurately. Second, to boost the model's awareness of fine-grained anomalies, we condition the learnable prompts of the text encoder on image context extracted from the vision encoder and perform a local-to-global representation fusion. Moreover, our method can incorporate vision foundation models such as DINOv2 to further enhance spatial understanding and localization. The key insight of Crane is to balance learnable adaptations for modeling anomalous concepts with non-learnable adaptations that preserve and exploit generalized pretrained knowledge, thereby minimizing in-domain overfitting and maximizing performance on unseen domains. Extensive evaluation across 14 diverse industrial and medical datasets demonstrates that Crane consistently improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed. The code is available at https://github.com/AlirezaSalehy/Crane.
中文: 本文提出Crane方法,通过相关性注意力机制和图像上下文驱动的可学习提示,解决CLIP在零样本异常检测中的空间错位问题,在保持推理速度的同时,在14个工业与医疗数据集上实现了2%至28%的性能提升。
English: This paper introduces Crane, a method that enhances zero-shot anomaly detection by addressing CLIP's spatial misalignment and insensitivity to local anomalies through correlation-based attention and context-aware prompts, achieving state-of-the-art performance across diverse datasets without compromising inference speed.

Authors:Sukannya Purkayastha, Zhuang Li, Anne Lauscher, Lizhen Qu, Iryna Gurevych
Title: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
Abstract:
Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: https://github.com/UKPLab/acl2025-lazy-review)
中文: 本研究推出了LazyReview数据集用于检测同行评审中的惰性思维,证明经过微调的大语言模型能显著提升检测效果,且基于此类分析的反馈能有效提高评审质量。
English: This study introduces LazyReview, a dataset for detecting lazy thinking in peer reviews, showing that fine-tuned LLMs significantly improve detection and that feedback based on such analysis enhances review quality.

Authors:Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Yu Wang
Title: QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models
Abstract:
In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at https://github.com/btzyd/qava.
中文: 本文提出查询无关视觉攻击(QAVA)方法,通过生成鲁棒性对抗图像,使大型视觉语言模型在面对未知问题时仍产生错误回答,在目标问题不明确的情况下显著提升了攻击效果与效率。
English: This paper introduces Query-Agnostic Visual Attack (QAVA), a method that generates robust adversarial images capable of misleading large vision-language models into giving incorrect answers to unspecified questions, significantly enhancing attack effectiveness and efficiency when the target question is unknown.

Authors:Alexandru Vasilache, Jona Scholz, Vincent Schilling, Sven Nitzsche, Florian Kaelber, Johannes Korsch, Juergen Becker
Title: A PyTorch-Compatible Spike Encoding Framework for Energy-Efficient Neuromorphic Applications
Abstract:
Spiking Neural Networks (SNNs) offer promising energy efficiency advantages, particularly when processing sparse spike trains. However, their incompatibility with traditional datasets, which consist of batches of input vectors rather than spike trains, necessitates the development of efficient encoding methods. This paper introduces a novel, open-source PyTorch-compatible Python framework for spike encoding, designed for neuromorphic applications in machine learning and reinforcement learning. The framework supports a range of encoding algorithms, including Leaky Integrate-and-Fire (LIF), Step Forward (SF), Pulse Width Modulation (PWM), and Ben's Spiker Algorithm (BSA), as well as specialized encoding strategies covering population coding and reinforcement learning scenarios. Furthermore, we investigate the performance trade-offs of each method on embedded hardware using C/C++ implementations, considering energy consumption, computation time, spike sparsity, and reconstruction accuracy. Our findings indicate that SF typically achieves the lowest reconstruction error and offers the highest energy efficiency and fastest encoding speed, achieving the second-best spike sparsity. At the same time, other methods demonstrate particular strengths depending on the signal characteristics. This framework and the accompanying empirical analysis provide valuable resources for selecting optimal encoding strategies for energy-efficient SNN applications.
中文: 本文提出了一种新型开源脉冲编码框架,支持多种编码算法,实证研究表明步进前向编码在能量效率和重构精度方面表现最优,而其他方法在不同信号特性下各具优势。
English: This paper presents a novel open-source PyTorch-compatible spike encoding framework supporting multiple algorithms, with empirical analysis showing Step Forward encoding achieves optimal energy efficiency and reconstruction accuracy while other methods excel in specific signal scenarios.

Authors:Hyejin Lee, Seokjun Hong, Jeonghoon Song, Haechan Cho, Zhixiong Jin, Byeonghun Kim, Joobin Jin, Jaegyun Im, Byeongjoon Noh, Hwasoo Yeo
Title: DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environment
Abstract:
Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at https://github.com/AIxMobility/The-DRIFT.
中文: DRIFT数据集通过同步无人机视频提供大规模高分辨率车辆轨迹数据,支持从微观车辆行为到宏观路网流量的多尺度交通分析,无需额外预处理即可直接应用。
English: The DRIFT dataset offers a large-scale collection of high-resolution vehicle trajectories from synchronized drone videos, enabling multi-scale traffic analysis from individual maneuvers to network dynamics without preprocessing.

Authors:Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du
Title: Dynamic Compressing Prompts for Efficient Inference of Large Language Models
Abstract:
Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.
中文: 本文提出动态压缩提示(LLM-DCP)方法,通过将提示压缩建模为马尔可夫决策过程,结合奖励函数和分层训练策略,在保持性能的同时逐步去除冗余标记,尤其在较高压缩率下优于现有技术。
English: This paper introduces Dynamic Compressing Prompts (LLM-DCP), a task-agnostic method that models prompt compression as a Markov Decision Process to sequentially remove redundant tokens while preserving performance through a reward function and hierarchical training strategy, outperforming existing techniques especially at high compression rates.

Authors:Bo-Cheng Hu, Ge-Peng Ji, Dian Shao, Deng-Ping Fan
Title: PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation
Abstract:
Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: https://github.com/ai4colonoscopy/PraNet-V2/tree/main/binary_seg/jittor.
Chinese: PraNet-V2通过引入双重监督反向注意力模块,解决了PraNet-V1在多类别分割任务中的不足,在多个息肉分割数据集上表现出色,并显著提升了语义分割模型的平均Dice分数。
English: PraNet-V2 introduces a Dual-Supervised Reverse Attention module to overcome PraNet-V1's limitations in multi-class segmentation, achieving enhanced performance across polyp segmentation datasets and improving mean Dice scores in semantic segmentation models.

Authors:Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji
Title: An Efficient and Mixed Heterogeneous Model for Image Restoration
Abstract:
Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.
中文: RestorMixer是一种新颖的混合架构融合模型,有效结合了CNN、Transformer和Mamba的优势,能够在保持高效推理的同时解决多种图像复原任务。
English: RestorMixer is a novel mixed-architecture fusion model that effectively integrates CNNs, Transformers, and Mambas to address diverse image restoration challenges while maintaining high inference efficiency.

Authors:Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo, Xun Yang, Meng Wang
Title: Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering
Abstract:
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
AMDNet introduces an efficient approach for partially relevant video retrieval by actively discovering semantically consistent moments using learnable span anchors and masked attention, achieving superior performance with fewer parameters.
English Summary:

Authors:Changjiang Gao, Hankun Lin, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen
Title: Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Abstract:
The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval
中文: 本研究评估了40多个大语言模型在12种语言中的表现,发现经过后训练的小型开源模型在跨语言上下文检索能力上可媲美GPT-4o,其性能依赖于预训练阶段形成的分层处理机制,且需要通过多语言后训练而非扩大预训练规模来充分释放潜力。
English: This study evaluates over 40 large language models across 12 languages, revealing that small post-trained open models match GPT-4o's cross-lingual context retrieval ability, with performance relying on phased processes formed during pre-training and enhanced through multilingual post-training rather than larger pretraining scales.

Authors:Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang
Title: Efficient Reasoning Models: A Survey
Abstract:
Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference of reasoning models. A curated collection of papers discussed in this survey is available in our GitHub repository: https://github.com/fscdc/Awesome-Efficient-Reasoning-Models.
中文摘要:推理模型通过长思维链实现高精度但带来巨大计算开销,本综述将加速方法归纳为缩短推理链、缩小模型规模及优化解码策略三大方向。
English Summary: Reasoning models achieve high accuracy through extended Chain-of-Thoughts but incur significant computational costs, prompting this survey to categorize acceleration methods into shorter reasoning chains, smaller models, and faster decoding strategies.

Authors:Yuwen Liao, Xinhang Xu, Ruofei Bai, Yizhuo Yang, Muqing Cao, Shenghai Yuan, Lihua Xie
Title: Following Is All You Need: Robot Crowd Navigation Using People As Planners
Abstract:
Navigating in crowded environments requires the robot to be equipped with high-level reasoning and planning techniques. Existing works focus on developing complex and heavyweight planners while ignoring the role of human intelligence. Since humans are highly capable agents who are also widely available in a crowd navigation setting, we propose an alternative scheme where the robot utilises people as planners to benefit from their effective planning decisions and social behaviours. Through a set of rule-based evaluations, we identify suitable human leaders who exhibit the potential to guide the robot towards its goal. Using a simple base planner, the robot follows the selected leader through shorthorizon subgoals that are designed to be straightforward to achieve. We demonstrate through both simulated and real-world experiments that our novel framework generates safe and efficient robot plans compared to existing planners, even without predictive or data-driven modules. Our method also brings human-like robot behaviours without explicitly defining traffic rules and social norms. Code will be available at https://github.com/centiLinda/PeopleAsPlanner.git.
中文: 本文提出了一种新颖的机器人导航框架,在拥挤环境中利用人类领航者进行高效安全的路径规划,无需复杂预测模型即可实现类人行为。
English: This paper introduces a novel robot navigation framework that leverages human leaders in crowded settings for efficient and safe path planning, eliminating the need for complex predictive models while achieving human-like behavior.

Authors:Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon
Title: TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos
Abstract:
Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).
中文: 提出的TFMPathy系统利用表格基础模型,在解决隐私限制的同时,显著提升了基于视频衍生数据的共情检测准确率和跨被试泛化能力。
English: The proposed TFMPathy system leverages tabular foundation models to significantly improve empathy detection accuracy and cross-subject generalization from video-derived data while addressing privacy constraints.

Authors:Jessica Lin, Amir Zeldes
Title: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction
Abstract:
Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at https://github.com/jl908069/gum_sum_salience to support further research on graded salient entity extraction.
中文摘要:本文提出了一种新颖的实体显著性分级方法,通过结合主观评分和基于摘要的方法,利用多篇摘要计算实体重要性,在多种文本类型中展现出优于现有技术的性能表现。
English Summary: This paper introduces a novel graded entity salience method that combines subjective scoring and summarization-based approaches, achieving superior performance over existing techniques by calculating entity importance through multiple summaries across diverse genres.

Authors:Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, Surajit Chaudhuri
Title: Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables
Abstract:
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.
Chinese: 本研究提出了一种名为语义域约束的新型数据质量约束,无需领域专家手动指定即可自动推断并应用于任何表格,既能直接检测错误,又能增强现有数据清洗方法,并具有可证明的质量保证。
English: This study introduces Semantic-Domain Constraints, a novel class of data-quality constraints that can be automatically inferred and applied to any table without manual domain-expert input, enabling both direct error detection and enhancement of existing data-cleaning methods with provable quality guarantees.

Authors:Yueqian Lin, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Hai "Helen" Li, Yiran Chen
Title: HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
Abstract:
Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
Chinese: HippoMM是一种受生物学启发的架构,将海马体机制转化为多模态理解的计算优势,在HippoVlog基准测试中显著超越了现有最优方法的准确率和响应时间。
English: HippoMM is a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding, significantly outperforming state-of-the-art approaches in accuracy and response time on the HippoVlog benchmark.

Authors:Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr
Title: The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Abstract:
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax
Chinese: 本文提出"越狱代价"概念,指出尽管越狱攻击能突破AI模型防护,但会导致模型性能显著下降——在多项基准测试中准确率降幅高达92%,并建立了评估越狱攻击效用的新基准。
English: This paper introduces the concept of "jailbreak tax," showing that while jailbreak attacks can bypass AI model safeguards, they significantly reduce the model's utility, as demonstrated by up to a 92% drop in accuracy across various benchmarks.

Authors:Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen
Title: Relation-Rich Visual Document Generator for Visual Information Extraction
Abstract:
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .
中文摘要:提出的RIDGE模型通过基于大语言模型的内容生成和从OCR结果驱动布局生成的两阶段方法,克服了视觉文档理解中布局多样性和数据稀缺的挑战,显著提升了各类VIE基准测试的性能。
English Summary: The proposed RIDGE model overcomes limitations in visual document understanding by generating relation-rich documents through content creation using LLMs and content-driven layout generation from OCR data, significantly improving performance on VIE benchmarks.

Authors:Nafis Sadeq, Xin Xu, Zhouhang Xie, Julian McAuley, Byungkyu Kang, Prarit Lamba, Xiang Gao
Title: Improving In-Context Learning with Reasoning Distillation
Abstract:
Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at https://github.com/NafisSadeq/reasoning-distillation.git.
中文: ReDis是一种推理蒸馏技术,通过数据增强和微调提升语言模型的归纳推理能力,在多项任务中表现卓越,部分情况下甚至超越了GPT-4o。
English: ReDis is a reasoning distillation technique that enhances language models' inductive reasoning through data augmentation and fine-tuning, achieving superior performance across multiple tasks and even surpassing GPT-4o in some cases.

Authors:Laura S. Herzog, Lucas Berent, Aleksander Kubica, Robert Wille
Title: Lattice Surgery Compilation Beyond the Surface Code
Abstract:
Large-scale fault-tolerant quantum computation requires compiling logical circuits into physical operations tailored to a given architecture. Prior work addressing this challenge has mostly focused on the surface code and lattice surgery schemes. In this work, we broaden the scope by considering lattice surgery compilation for topological codes beyond the surface code. We begin by defining a code substrate - a blueprint for implementing topological codes and lattice surgery. We then abstract from the microscopic details and rephrase the compilation task as a mapping and routing problem on a macroscopic routing graph, potentially subject to substrate-specific constraints. We explore specific substrates and codes, including the color code and the folded surface code, providing detailed microscopic constructions. For the color code, we present numerical simulations analyzing how design choices at the microscopic and macroscopic levels affect the depth of compiled logical $\mathrm{CNOT}+\mathrm{T}$ circuits. An open-source code is available on GitHub https://github.com/cda-tum/mqt-qecc.
中文摘要:本研究通过引入代码基底框架将晶格手术编译扩展到表面代码之外的拓扑代码,将编译任务转化为映射与路由问题,并对颜色代码和折叠表面代码进行了详细实现与性能分析。
English Summary: This research expands lattice surgery compilation to topological codes beyond the surface code by introducing a code substrate framework that transforms compilation into a mapping and routing problem, with detailed implementations and performance analysis for color codes and folded surface codes.

Authors:Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Title: LEMUR Neural Network Dataset: Towards Seamless AutoML
Abstract:
Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr
中文: LEMUR 是一个开源数据集与框架,通过统一模板标准化 PyTorch 神经网络模型,集成自动化优化与分析工具,旨在加速 AutoML 研究并保障实验可复现性。
English: LEMUR is an open-source dataset and framework that standardizes PyTorch neural networks across various tasks, integrating automated optimization and analysis tools to accelerate AutoML research and ensure reproducibility.

Authors:Mingyang Zhu, Yinting Liu, Mingyu Li, Jiacheng Wang
Title: PathSeqSAM: Sequential Modeling for Pathology Image Segmentation with SAM2
Abstract:
Current methods for pathology image segmentation typically treat 2D slices independently, ignoring valuable cross-slice information. We present PathSeqSAM, a novel approach that treats 2D pathology slices as sequential video frames using SAM2's memory mechanisms. Our method introduces a distance-aware attention mechanism that accounts for variable physical distances between slices and employs LoRA for domain adaptation. Evaluated on the KPI Challenge 2024 dataset for glomeruli segmentation, PathSeqSAM demonstrates improved segmentation quality, particularly in challenging cases that benefit from cross-slice context. We have publicly released our code at https://github.com/JackyyyWang/PathSeqSAM.
中文: PathSeqSAM提出了一种创新方法,将二维病理切片视为连续视频帧,利用SAM2的记忆机制结合距离感知注意力机制和LoRA进行领域适配,在KPI挑战赛2024数据集上实现了更优的分割效果。
English: PathSeqSAM introduces a novel method that treats 2D pathology slices as sequential video frames, utilizing SAM2's memory mechanisms with a distance-aware attention mechanism and LoRA for domain adaptation, achieving improved segmentation quality on the KPI Challenge 2024 dataset.

Authors:Zhe Wang, Fangtian Fu, Wei Zhang, Lige Yan, Yan Meng, Jianping Wu, Hui Wu, Gang Xu, Si Chen
Title: BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications
Abstract:
Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we present BioChemInsight, an open-source pipeline that integrates: (1) DECIMER Segmentation and MolVec for chemical structure recognition, (2) Qwen2.5-VL-32B for compound identifier association, and (3) PaddleOCR with Gemini-2.0-flash for bioactivity extraction and unit normalization. We evaluated the performance of BioChemInsight on 25 patents and 17 articles. BioChemInsight achieved 95% accuracy for tabular patent data (structure/identifier recognition), with lower accuracy in non-tabular patents (~80% structures, ~75% identifiers), plus 92.2 % bioactivity extraction accuracy. For articles, it attained >99% identifiers and 78-80% structure accuracy in non-tabular formats, plus 97.4% bioactivity extraction accuracy. The system generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours while enabling applications in high-throughput screening and ML-driven drug design (https://github.com/dahuilangda/BioChemInsight).
中文:BioChemInsight是一个开源流程,能够自动从专利和文献中提取化学结构及其生物活性数据,准确率高,通过生成可直接使用的构效关系数据集,大幅加速药物发现进程。
English: BioChemInsight is an open-source pipeline that automates the extraction of chemical structures and their bioactivity data from patents and articles, achieving high accuracy and significantly accelerating drug discovery by generating ready-to-use SAR datasets.

Authors:Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Abstract:
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
中文摘要:ColorBench是一个评估视觉语言模型颜色理解能力的新基准,揭示了现有模型在颜色感知、推理和鲁棒性方面存在显著不足,尽管模型规模扩大和思维链推理能带来一定提升。
English Summary: ColorBench is a novel benchmark that evaluates vision-language models' color understanding, revealing significant limitations in their ability to perceive, reason with, and maintain robustness regarding colors, despite scaling laws and CoT reasoning offering some improvements.

Authors:Ning Li, Jingran Zhang, Justin Cui
Title: ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing
Abstract:
Large language models (LLMs) demonstrate strong capabilities in reasoning and question answering, yet their tendency to generate factually incorrect content remains a critical challenge. This study evaluates proprietary and open-source LLMs on generating relevant research papers with accurate arXiv links. Our evaluation reveals critical academic risks: LLMs frequently generate incorrect arXiv links or references to non-existent papers, fundamentally undermining their ability to properly attribute research contributions to the actual authors. We introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings show concerning accuracy variations across subjects, with Claude-3.5-Sonnet exhibiting a substantial advantage in generating both relevant and accurate responses. Notably, most LLMs perform significantly better in Artificial Intelligence than other subfields. This benchmark provides a standardized tool for evaluating LLM reliability in scientific contexts, promoting more dependable academic use in research environments. Our code and dataset are available at https://github.com/liningresearch/arXivBench and https://huggingface.co/datasets/arXivBenchLLM/arXivBench.
中文: 本研究推出arXivBench基准测试,揭示大语言模型常生成错误arXiv链接和虚假参考文献,其中Claude-3.5-Sonnet在人工智能领域表现最佳,凸显了其在学术应用中的严重可靠性问题。
English: This study introduces arXivBench, a benchmark revealing that large language models often produce inaccurate arXiv links and references, with Claude-3.5-Sonnet showing superior accuracy particularly in Artificial Intelligence, highlighting critical reliability concerns in academic applications.

Authors:Vikranth Udandarao, Noel Abraham Tiju, Muthuraj Vairamuthu, Harsh Mistry, Dhruv Kumar
Title: Roamify: Designing and Evaluating an LLM Based Google Chrome Extension for Personalised Itinerary Planning
Abstract:
In this paper, we present Roamify, an Artificial Intelligence powered travel assistant that aims to ease the process of travel planning. We have tested and used multiple Large Language Models like Llama and T5 to generate personalised itineraries per user preferences. Results from user surveys highlight the preference for AI powered mediums over existing methods to help in travel planning across all user age groups. These results firmly validate the potential need of such a travel assistant. We highlight the two primary design considerations for travel assistance: D1) incorporating a web-scraping method to gather up-to-date news articles about destinations from various blog sources, which significantly improves our itinerary suggestions, and D2) utilising user preferences to create customised travel experiences along with a recommendation system which changes the itinerary according to the user needs. Our findings suggest that Roamify has the potential to improve and simplify how users across multiple age groups plan their travel experiences.
中文: Roamify是一款人工智能驱动的旅行助手,它利用大型语言模型和网络爬取数据,根据用户偏好生成个性化行程,研究显示各年龄段用户均倾向于使用此类AI工具来简化旅行规划。
English: Roamify is an AI-powered travel assistant that uses large language models to create personalized itineraries based on user preferences and real-time web-scraped data, demonstrating strong user preference across all age groups for simplifying travel planning.

Authors:Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette
Title: FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation
Abstract:
In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of , a sketch of a ). We investigate the impact of templates for OVSS, and find that for each class, there exist single-template classifiers--which we refer to as class-experts--that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in low-data scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .
中文摘要:本文提出FLOSS方法,通过熵分析筛选单模板类别专家分类器并进行融合,无需额外训练或标注即可显著提升开放词汇语义分割性能。
English Summary: This paper introduces FLOSS, a plug-and-play method that identifies single-template class-expert classifiers through entropy analysis and combines them to significantly enhance Open-Vocabulary Semantic Segmentation without requiring additional training or labels.

Authors:Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
Title: MIEB: Massive Image Embedding Benchmark
Abstract:
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
中文摘要:我们推出了大规模图像嵌入基准(MIEB),通过涵盖38种语言的130项任务全面评估图像及图文嵌入模型,发现没有单一模型能在所有类别中表现卓越,同时揭示了先进视觉模型在文本视觉表征方面的优势及其在混合编码和干扰环境下图文匹配的局限性。
English Summary: The Massive Image Embedding Benchmark (MIEB) is introduced to comprehensively evaluate image and image-text embedding models across 130 tasks in 38 languages, revealing that no single model excels in all categories while uncovering both strengths and limitations in advanced vision models.

Authors:Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng
Title: Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Abstract:
Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.
Chinese: Pixel-SAIL是一种高度简化的多模态大语言模型,通过三项关键技术改进,在像素级理解任务中取得了可比甚至更优的性能,无需依赖视觉编码器等额外组件。
English: Pixel-SAIL is a highly simplified multimodal large language model that achieves comparable or superior performance in pixel-level understanding tasks through three key technical improvements, eliminating the need for extra components like vision encoders.

Authors:Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
Title: The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Abstract:
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.
中文摘要:SAIL是一种统一的多模态大语言模型,在单一Transformer架构中整合视觉与语言处理,无需预训练视觉编码器即可达到模块化模型的性能水平,同时展现出更强的可扩展性和独特的跨模态交互特性。
English Summary: SAIL is a unified multimodal large language model that integrates vision and language processing within a single transformer architecture, achieving performance comparable to modular models while offering enhanced scalability and distinct cross-modal interaction patterns.

Authors:Davide Piras, Francesco Sorrenti, Ruth Durrer, Martin Kunz
Title: Anchors no more: Using peculiar velocities to constrain $H_0$ and the primordial Universe without calibrators
Abstract:
We develop a novel approach to constrain the Hubble parameter $H_0$ and the primordial power spectrum amplitude $A_\mathrm{s}$ using type Ia supernovae (SNIa) data. By considering SNIa as tracers of the peculiar velocity field, we can model their distance and their covariance as a function of cosmological parameters without the need of calibrators like Cepheids; this yields a new independent probe of the large-scale structure based on SNIa data without distance anchors. Crucially, we implement a differentiable pipeline in JAX, including efficient emulators and affine sampling, reducing inference time from years to hours on a single GPU. We first validate our method on mock datasets, demonstrating that we can constrain $H_0$ and $\log 10^{10}A_\mathrm{s}$ within $10\%$ and $15\%$, respectively, using $\mathcal{O}(10^3)$ SNIa. We then test our pipeline with SNIa from an $N$-body simulation, obtaining $6\%$-level unbiased constraints on $H_0$ with a moderate noise level. We finally apply our method to Pantheon+ data, constraining $H_0$ at the $15\%$ level without Cepheids when fixing $A_\mathrm{s}$ to its $\it{Planck}$ value. On the other hand, we obtain $20\%$-level constraints on $\log 10^{10}A_\mathrm{s}$ in agreement with $\it{Planck}$ when including Cepheids in the analysis. In light of upcoming observations of low redshift SNIa from the Zwicky Transient Facility and the Vera Rubin Legacy Survey of Space and Time, surveys for which our method will develop its full potential, we make our code publicly available.
Chinese: 我们开发了一种利用Ia型超新星作为本动速度示踪物的新方法,无需距离校准器即可独立约束哈勃参数和原初功率谱振幅,通过JAX可微分计算框架将推断时间从数年缩短至数小时。
English: We introduce a novel method using type Ia supernovae as peculiar velocity tracers to independently constrain the Hubble parameter and primordial power spectrum amplitude, achieving efficient inference through a differentiable JAX pipeline that reduces computation time from years to hours.

Authors:Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
Title: M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Abstract:
Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.
Chinese: M1模型基于Mamba架构的混合线性RNN,通过提高推理效率,在性能上超越了以往的线性RNN模型,与顶尖蒸馏模型相当,同时相比Transformer实现了超过3倍的生成加速。
English: The M1 model, a hybrid linear RNN based on the Mamba architecture, enhances reasoning efficiency by outperforming prior linear RNNs and matching state-of-the-art distilled models while achieving over 3x speedup in generation compared to transformers.

Authors:Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
Title: RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Abstract:
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
中文: RealWebAssist 是一个新颖的基准测试,旨在评估AI代理处理现实世界中连续、模糊用户指令的能力,当前模型在意图推理和图形界面定位方面仍面临显著挑战。
English: RealWebAssist is a new benchmark designed to evaluate AI agents' ability to handle sequential, ambiguous real-world user instructions for long-horizon web tasks, where current models struggle with intent reasoning and GUI grounding.

Authors:Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Title: Multimodal Long Video Modeling Based on Temporal Dynamic Context
Abstract:
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
中文: 本文提出时序动态上下文(TDC)方法,通过将视频分割为语义场景、使用时序上下文压缩器减少标记数量,并采用思维链策略处理超长视频,在视频与音频理解基准测试中表现优异。
English: This paper introduces Temporal Dynamic Context (TDC), a dynamic long video encoding method that segments videos into scenes, compresses tokens using a temporal context compressor, and employs a chain-of-thought strategy for enhanced video and audio understanding, achieving strong performance on benchmarks.

Authors:Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Title: Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing
Abstract:
Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM
中文摘要:本文提出隐式结构锁定(ISLock)方法,通过自注意力对齐实现无需训练的自回归图像编辑,在保持结构一致性的同时弥合了与扩散模型的性能差距。
English Summary: This paper introduces Implicit Structure Locking (ISLock), a training-free editing method for autoregressive image models that preserves structural consistency through self-attention alignment, bridging the performance gap with diffusion models.

Authors:Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian
Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model
Abstract:
Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.
中文: MonoDiff9D是一种基于扩散模型的方法,它通过利用概率扩散模型和DINOv2的粗略深度估计,无需形状先验或CAD模型即可实现最先进的单目类别级9D物体姿态估计。
English: MonoDiff9D is a diffusion-based method that achieves state-of-the-art monocular category-level 9D object pose estimation without requiring shape priors or CAD models by leveraging probabilistic diffusion models and coarse depth estimation from DINOv2.

Authors:Yonghui Yang, Le Wu, Yuxin Liao, Zhuangzhuang He, Pengyang Shao, Richang Hong, Meng Wang
Title: Invariance Matters: Empowering Social Recommendation via Graph Invariant Learning
Abstract:
Graph-based social recommendation systems have shown significant promise in enhancing recommendation performance, particularly in addressing the issue of data sparsity in user behaviors. Typically, these systems leverage Graph Neural Networks (GNNs) to capture user preferences by incorporating high-order social influences from observed social networks. However, existing graph-based social recommendations often overlook the fact that social networks are inherently noisy, containing task-irrelevant relationships that can hinder accurate user preference learning. The removal of these redundant social relations is crucial, yet it remains challenging due to the lack of ground truth. In this paper, we approach the social denoising problem from the perspective of graph invariant learning and propose a novel method, Social Graph Invariant Learning(SGIL). Specifically,SGIL aims to uncover stable user preferences within the input social graph, thereby enhancing the robustness of graph-based social recommendation systems. To achieve this goal, SGIL first simulates multiple noisy social environments through graph generators. It then seeks to learn environment-invariant user preferences by minimizing invariant risk across these environments. To further promote diversity in the generated social environments, we employ an adversarial training strategy to simulate more potential social noisy distributions. Extensive experimental results demonstrate the effectiveness of the proposed SGIL. The code is available at https://github.com/yimutianyang/SIGIR2025-SGIL.
中文: 针对图社交推荐系统中社交网络噪声问题,本文提出社交图不变学习方法,通过对抗训练模拟多种噪声环境来学习稳定的用户偏好,从而增强推荐系统的鲁棒性。
English: Graph-based social recommendation systems often suffer from noisy social networks, so this paper proposes Social Graph Invariant Learning (SGIL) to identify stable user preferences by simulating multiple noisy environments through adversarial training, thereby improving recommendation robustness.

Authors:Michał Turski, Mateusz Chiliński, Łukasz Borchmann
Title: Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA
Abstract:
Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA
中文: CheckboxQA数据集的推出旨在解决大型视觉与语言模型在复选框识别上的不足,成为提升法律科技和金融等领域文档处理能力的重要工具。
English: The CheckboxQA dataset is introduced to address the limitations of Large Vision and Language Models in interpreting checkboxes, serving as a critical tool for improving document processing in fields like legal tech and finance.

Authors:Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy
Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
Abstract:
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
中文摘要:LLM-SRBench是一个新颖的基准测试,通过涵盖四个科学领域的239个挑战性问题来严格评估大语言模型在科学方程发现中的能力,其设计能有效防止记忆效应,结果显示当前最优方法仅达31.5%的准确率,凸显了该领域的研究挑战。
English Summary: LLM-SRBench is a novel benchmark designed to rigorously evaluate LLMs' scientific equation discovery capabilities by preventing memorization through 239 challenging problems across four domains, revealing that current methods achieve only 31.5% accuracy and underscoring the field's difficulties.

Authors:Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi
Title: Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol
Abstract:
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.
中文: 本研究通过结合大语言模型方法与人工标注,解决了用户查询模糊和内容不相关等现实难题,提出了ARXIV2TABLE基准测试,实验表明现有模型在此任务上仍有明显不足。
English: This research advances literature review table generation by addressing real-world challenges like vague user queries and irrelevant content through a novel LLM-based approach and introduces the ARXIV2TABLE benchmark, revealing current models' limitations despite improvements.

Authors:Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Title: Probing then Editing Response Personality of Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that simulate consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in simulating personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly simulate personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.
中文: 本研究提出分层探测框架,发现大语言模型主要在中间和上层模拟人格特征,并提出一种有效的扰动方法,可在保持模型通用能力的同时编辑人格表达。
English: This study introduces a layer-wise probing framework revealing that large language models primarily simulate personality traits in middle and upper layers, and proposes an effective perturbation method to edit these traits with minimal impact on general capabilities.

Authors:Frederik Werner, Simon Sagmeister, Mattia Piccinini, Johannes Betz
Title: A Quasi-Steady-State Black Box Simulation Approach for the Generation of g-g-g-v Diagrams
Abstract:
The classical g-g diagram, representing the achievable acceleration space for a vehicle, is commonly used as a constraint in trajectory planning and control due to its computational simplicity. To address non-planar road geometries, this concept can be extended to incorporate g-g constraints as a function of vehicle speed and vertical acceleration, commonly referred to as g-g-g-v diagrams. However, the estimation of g-g-g-v diagrams is an open problem. Existing simulation-based approaches struggle to isolate non-transient, open-loop stable states across all combinations of speed and acceleration, while optimization-based methods often require simplified vehicle equations and have potential convergence issues. In this paper, we present a novel, open-source, quasi-steady-state black box simulation approach that applies a virtual inertial force in the longitudinal direction. The method emulates the load conditions associated with a specified longitudinal acceleration while maintaining constant vehicle speed, enabling open-loop steering ramps in a purely QSS manner. Appropriate regulation of the ramp steer rate inherently mitigates transient vehicle dynamics when determining the maximum feasible lateral acceleration. Moreover, treating the vehicle model as a black box eliminates model mismatch issues, allowing the use of high-fidelity or proprietary vehicle dynamics models typically unsuited for optimization approaches. An open-source version of the proposed method is available at: https://github.com/TUM-AVS/GGGVDiagrams
经典g-g图被扩展为适用于非平面道路的g-g-g-v图,但其估计仍存在挑战,因为难以分离稳定状态且模型受限,因此提出了一种新型开源准稳态模拟方法,通过虚拟惯性力精确计算最大横向加速度,并将车辆视为黑箱以获取高保真结果。
The classical g-g diagram is extended to g-g-g-v diagrams for non-planar roads, but their estimation remains challenging due to difficulties in isolating stable states and model limitations, leading to a novel open-source quasi-steady-state simulation that uses a virtual inertial force to accurately determine maximum lateral acceleration while treating the vehicle as a black box for high-fidelity results.

Authors:Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Title: Efficient Generative Model Training via Embedded Representation Warmup
Abstract:
Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at https://github.com/LINs-lab/ERW.
Chinese: 生成模型面临高层次语义与低层次合成细节的平衡难题,因此我们提出嵌入式表征预热(ERW)框架,通过先构建语义基础再优化合成的两阶段训练,实现了11.5倍加速和更优性能。
English: Generative models struggle with balancing high-level semantics and low-level synthesis details, so we propose Embedded Representation Warmup (ERW), a two-phase training framework that first builds a semantic foundation and then refines synthesis, achieving an 11.5× speedup and superior performance.

Authors:Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Title: Efficient Generative Model Training via Embedded Representation Warmup
Abstract:
Generative models face a fundamental challenge: they must simultaneously learn high-level semantic concepts (what to generate) and low-level synthesis details (how to generate it). Conventional end-to-end training entangles these distinct, and often conflicting objectives, leading to a complex and inefficient optimization process. We argue that explicitly decoupling these tasks is key to unlocking more effective and efficient generative modeling. To this end, we propose Embedded Representation Warmup (ERW), a principled two-phase training framework. The first phase is dedicated to building a robust semantic foundation by aligning the early layers of a diffusion model with a powerful pretrained encoder. This provides a strong representational prior, allowing the second phase -- generative full training with alignment loss to refine the representation -- to focus its resources on high-fidelity synthesis. Our analysis confirms that this efficacy stems from functionally specializing the model's early layers for representation. Empirically, our framework achieves a 11.5$\times$ speedup in 350 epochs to reach FID=1.41 compared to single-phase methods like REPA. Code is available at https://github.com/LINs-lab/ERW.
Chinese: 生成模型面临高层次语义与低层次合成细节的平衡难题,因此我们提出嵌入式表征预热(ERW)框架,通过先构建语义基础再优化合成的两阶段训练,实现了11.5倍加速和更优性能。
English: Generative models struggle with balancing high-level semantics and low-level synthesis details, so we propose Embedded Representation Warmup (ERW), a two-phase training framework that first builds a semantic foundation and then refines synthesis, achieving an 11.5× speedup and superior performance.

Authors:Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Title: LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks
Abstract:
Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.
中文: 大语言模型遗忘可以通过仅使用遗忘集中极小部分(如5%)作为核心集有效实现,这归因于高影响力关键词而非整个数据集的作用,且该效应在不同方法和数据选择策略中均表现稳健。
English: Large language model unlearning can be effectively achieved using a surprisingly small subset of the forget set, known as a coreset, as minimal as 5%, due to the influence of high-impact keywords rather than the entire dataset, with this effect being robust across various methods and data selection approaches.

Authors:Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu
Title: MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning
Abstract:
Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.
中文摘要:MT-R1-Zero框架通过规则与指标混合奖励机制,成功将强化学习应用于机器翻译领域,在实现与先进模型相媲美性能的同时,展现出在多语言和低资源场景下的强大泛化能力。
English Summary: The MT-R1-Zero framework successfully adapts reinforcement learning to machine translation by using a rule-metric mixed reward mechanism, achieving competitive performance against advanced models while demonstrating strong generalization in multilingual and low-resource settings.

Authors:Xiaopeng Li, Pengyue Jia, Derong Xu, Yi Wen, Yingyi Zhang, Wenlin Zhang, Wanyu Wang, Yichao Wang, Zhaocheng Du, Xiangyang Li, Yong Liu, Huifeng Guo, Ruiming Tang, Xiangyu Zhao
Title: A Survey of Personalization: From RAG to Agent
Abstract:
Personalization has become an essential capability in modern AI systems, enabling customized interactions that align with individual user preferences, contexts, and goals. Recent research has increasingly concentrated on Retrieval-Augmented Generation (RAG) frameworks and their evolution into more advanced agent-based architectures within personalized settings to enhance user satisfaction. Building on this foundation, this survey systematically examines personalization across the three core stages of RAG: pre-retrieval, retrieval, and generation. Beyond RAG, we further extend its capabilities into the realm of Personalized LLM-based Agents, which enhance traditional RAG systems with agentic functionalities, including user understanding, personalized planning and execution, and dynamic generation. For both personalization in RAG and agent-based personalization, we provide formal definitions, conduct a comprehensive review of recent literature, and summarize key datasets and evaluation metrics. Additionally, we discuss fundamental challenges, limitations, and promising research directions in this evolving field. Relevant papers and resources are continuously updated at https://github.com/Applied-Machine-Learning-Lab/Awesome-Personalized-RAG-Agent.
中文摘要:本综述系统探讨了检索增强生成框架中的个性化技术及其向个性化大语言模型智能体的演进,全面回顾了相关方法、数据集与挑战,并指出了未来研究方向。
English Summary: This survey systematically explores personalization in Retrieval-Augmented Generation (RAG) frameworks and their evolution into personalized LLM-based agents, reviewing methodologies, datasets, and challenges while identifying future research directions.

Authors:Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
Title: Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
Abstract:
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.
中文: GUI代理面临高质量数据稀缺的挑战,通过在中期训练阶段让视觉语言模型学习多样化推理任务,可显著提升其在图形界面规划场景中的泛化能力和性能表现。
English: GUI agents face data scarcity issues, but training Vision Language Models on diverse reasoning tasks during mid-training significantly enhances their generalization and performance across GUI planning scenarios.

Authors:Yating Liu, Yaowei Li, Xiangyuan Lan, Wenming Yang, Zimo Liu, Qingmin Liao
Title: UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Abstract:
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
中文: 本文提出UP-Person方法,通过整合Prefix、LoRA和Adapter三个轻量化组件,在仅微调4.7%参数的情况下有效迁移CLIP的多模态知识,在多个行人检索数据集上实现了最优性能。
English: This paper introduces UP-Person, a parameter-efficient transfer learning method that enhances text-based person retrieval by integrating Prefix, LoRA, and Adapter components to optimize CLIP's multi-modal knowledge with minimal parameter tuning, achieving state-of-the-art results on multiple datasets.

Authors:Wanyun Zhou, Saizhuo Wang, Xiang Li, Yiyan Qi, Jian Guo, Xiaowen Chu
Title: Unleashing Expert Opinion from Social Media for Stock Prediction
Abstract:
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies. The code can be seen in https://github.com/wanyunzh/DualGAT.
中文: 本研究提出动态专家追踪算法和双图注意力网络,通过过滤社交媒体噪声数据、识别有效交易专家并跨股票传播其信号,显著提升了股票趋势预测和收益率预测的准确性,同时结合传统金融特征构建了更稳健的投资策略。
English: This study introduces a dynamic expert tracing algorithm and a dual graph attention network to filter noisy social media data, identify valuable trading experts, and propagate their signals across stocks, significantly improving stock trend prediction and return ratio forecasts while integrating with traditional financial features for robust investment strategies.

Authors:Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, Zuxuan Wu
Title: Aligning Anime Video Generation with Human Feedback
Abstract:
Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our code and dataset are publicly available at https://github.com/bilibili/Index-anisora.
中文摘要:本研究提出了一种通过构建首个动漫视频多维度奖励数据集并开发AnimeReward模型及间隙感知偏好优化方法,来提升动漫视频生成质量、使其更符合人类偏好的创新流程。
English Summary: This study introduces a pipeline that improves anime video generation by creating the first multi-dimensional reward dataset with human feedback and developing the AnimeReward model alongside Gap-Aware Preference Optimization to better align outputs with human preferences.

Authors:Hao Ren, Yiming Zeng, Zetong Bi, Zhaoliang Wan, Junlong Huang, Hui Cheng
Title: Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
Abstract:
Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at https://github.com/hren20/NaiviBridger.
中文摘要:提出的 NaviBridger 框架通过利用扩散桥模型从信息性先验动作生成动作,改进了视觉导航性能,减少了去噪步骤,在仿真和真实环境中均优于基线方法。
English Summary: The proposed NaviBridger framework improves visual navigation by using diffusion bridge models to generate actions from informative prior actions, reducing denoising steps and outperforming baseline methods in both simulated and real-world environments.

Authors:Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li
Title: RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
Abstract:
Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR
中文: 本文提出了一种新颖的RGB-事件多模态行人属性识别任务和数据集(EventPAR),通过结合事件相机弥补RGB相机的局限性,在复杂环境下提升性能,并开发了基于RWKV的新框架取得领先成果。
English: This paper introduces a novel multi-modal RGB-Event pedestrian attribute recognition task and dataset (EventPAR), addressing limitations of RGB cameras by incorporating event cameras for improved performance in challenging conditions, along with a new RWKV-based framework achieving state-of-the-art results.

Authors:Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, Hui Cheng
Title: NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation
Abstract:
Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.
中文摘要:本文提出一种混合视觉导航方法,通过将条件扩散模型与可微分成本梯度相结合,无需重新训练即可生成有效路径,在多种环境中展现出优于基线方法的零样本迁移性能。
English Summary: This paper introduces a hybrid visual navigation approach that combines a conditional diffusion model with differentiable cost gradients to generate valid paths without retraining, demonstrating superior zero-shot performance in diverse environments compared to baseline methods.

Authors:Si-Tong Wei, Rui-Huan Wang, Chuan-Zhi Zhou, Baoquan Chen, Peng-Shuai Wang
Title: OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation
Abstract:
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.
中文: OctGPT是一种新颖的多尺度自回归模型,通过序列化八叉树表示和优化的变换器技术,显著提升了3D形状生成的效率和性能,可与扩散模型相媲美。
English: OctGPT is a novel multiscale autoregressive model that significantly enhances efficiency and performance in 3D shape generation, rivaling diffusion models through innovations like serialized octree representation and optimized transformers.

Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
Title: Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
Abstract:
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
中文: 本文提出对比提示学习(CPL)框架,通过稀疏提示模块减少冗余和对比提示正则化强化任务边界,有效提升全场景图像修复中的提示与任务对齐能力,在多项基准测试中实现了最先进的性能。
English: This paper introduces Contrastive Prompt Learning (CPL), a novel framework that enhances prompt-task alignment in all-in-one image restoration through a Sparse Prompt Module to reduce redundancy and Contrastive Prompt Regularization to strengthen task boundaries, achieving state-of-the-art performance across multiple benchmarks.

Authors:Dongliang Luo, Hanshen Zhu, Ziyang Zhang, Dingkang Liang, Xudong Xie, Yuliang Liu, Xiang Bai
Title: SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
Abstract:
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
中文: 本文提出SemiETS半监督端到端文本检测框架,通过生成可靠的分层伪标签和利用双向信息流解决伪标签不一致及监督次优问题,在多种数据集和设置下均实现卓越性能。
English: This paper introduces SemiETS, a semi-supervised framework for end-to-end text spotting that addresses challenges like inconsistent pseudo labels and sub-optimal supervision by generating reliable hierarchical pseudo labels and leveraging bidirectional information flow, achieving superior performance across various datasets and settings.

Authors:Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
Title: FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Abstract:
We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION
中文: FUSION是一种通过文本引导编码和递归对齐实现深度视觉语言融合的多模态大语言模型,在多个基准测试中以更少的视觉标记超越了更大规模的模型。
English: FUSION is a multimodal large language model that achieves deep vision-language integration through text-guided encoding and recursive alignment, outperforming larger models with fewer vision tokens across multiple benchmarks.

Authors:Maria Tzelepi, Vasileios Mezaris
Title: Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Abstract:
Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactions of images and their embedded text renders the hateful meme detection a challenging multimodal task. In this paper we propose to address the aforementioned task leveraging knowledge encoded in powerful Large Multimodal Models (LMM). Specifically, we propose to exploit LMMs in a two-fold manner. First, by extracting knowledge oriented to the hateful meme detection task in order to build strong meme representations. Specifically, generic semantic descriptions and emotions that the images along with their embedded texts elicit are extracted, which are then used to train a simple classification head for hateful meme detection. Second, by developing a novel hard mining approach introducing directly LMM-encoded knowledge to the training process, providing further improvements. We perform extensive experiments on two datasets that validate the effectiveness of the proposed method, achieving state-of-the-art performance. Our code and trained models are publicly available at: https://github.com/IDT-ITI/LMM-CLIP-meme.
中文: 本文提出了一种新颖的仇恨表情包检测方法,通过利用大型多模态模型提取任务导向知识并结合硬挖掘策略,经广泛实验验证实现了最优性能。
English: This paper presents a novel method for detecting hateful memes by leveraging Large Multimodal Models (LMMs) to extract task-oriented knowledge and employing a hard mining approach, achieving state-of-the-art performance through extensive experiments.

Authors:Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du, Zhiwei Xu, Longxiang Gao, Li Guo, Shibiao Xu
Title: Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at https://github.com/chenshunpeng/FoL
中文摘要:Focus on Local (FoL) 方法通过挖掘图像中的判别性局部区域并引入伪相关监督,在视觉位置识别的图像检索和重排序阶段均实现了最优性能,同时显著提升了计算效率。
English Summary: The Focus on Local (FoL) approach enhances Visual Place Recognition by mining discriminative local regions and introducing pseudo-correlation supervision, achieving state-of-the-art performance in both image retrieval and re-ranking while improving computational efficiency.

Authors:Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang
Title: SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis
Abstract:
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation (\textit{e.g.}, telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, \textit{\textbf{SafeSpeech}}, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, \textbf{S}peech \textbf{PE}rturbative \textbf{C}oncealment (\textbf{SPEC}), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real-time capability in real-world tests. The source code is available at \href{https://github.com/wxzyd123/SafeSpeech}{https://github.com/wxzyd123/SafeSpeech}.
中文:SafeSpeech是一种防御框架,通过在原始语音中嵌入难以察觉的扰动来防止高质量合成伪造音频,提供针对未经授权语音克隆和滥用的强大实时保护。
English: SafeSpeech is a defensive framework that embeds imperceptible perturbations into original speech to prevent high-quality synthetic deepfake audio, offering robust, real-time protection against unauthorized voice cloning and exploitation.

Authors:Korel Gundem, Zhengling Qi
Title: Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand
Abstract:
In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms to solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at https://github.com/gundemkorel/Inventory_Pricing_Control
本文提出了首个数据驱动方法,用于在具有截断和依赖需求的序列决策环境中优化定价与库存策略,通过新算法结合理论保证和实证验证实现了政策学习。
This paper introduces the first data-driven approach for optimizing pricing and inventory policies in sequential decision-making with censored and dependent demand, proposing novel algorithms with theoretical guarantees and empirical validation.

Authors:Hairong Zhang, Jiaheng Si, Guohang Yan, Boyuan Qi, Pinlong Cai, Song Mao, Ding Wang, Botian Shi
Title: RAKG:Document-level Retrieval Augmented Knowledge Graph Construction
Abstract:
With the rise of knowledge graph based retrieval-augmented generation (RAG) techniques such as GraphRAG and Pike-RAG, the role of knowledge graphs in enhancing the reasoning capabilities of large language models (LLMs) has become increasingly prominent. However, traditional Knowledge Graph Construction (KGC) methods face challenges like complex entity disambiguation, rigid schema definition, and insufficient cross-document knowledge integration. This paper focuses on the task of automatic document-level knowledge graph construction. It proposes the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework. RAKG extracts pre-entities from text chunks and utilizes these pre-entities as queries for RAG, effectively addressing the issue of long-context forgetting in LLMs and reducing the complexity of Coreference Resolution. In contrast to conventional KGC methods, RAKG more effectively captures global information and the interconnections among disparate nodes, thereby enhancing the overall performance of the model. Additionally, we transfer the RAG evaluation framework to the KGC field and filter and evaluate the generated knowledge graphs, thereby avoiding incorrectly generated entities and relationships caused by hallucinations in LLMs. We further developed the MINE dataset by constructing standard knowledge graphs for each article and experimentally validated the performance of RAKG. The results show that RAKG achieves an accuracy of 95.91 % on the MINE dataset, a 6.2 % point improvement over the current best baseline, GraphRAG (89.71 %). The code is available at https://github.com/LMMApplication/RAKG.
中文: 本文提出文档级检索增强知识图谱构建(RAKG)框架,通过解决大语言模型的长上下文遗忘问题并提升全局信息捕捉能力,在MINE数据集上实现了95.91%的准确率。
English: This paper introduces the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework, which enhances knowledge graph construction by addressing long-context forgetting in LLMs and improving global information capture, achieving a 95.91% accuracy on the MINE dataset.

Authors:Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che
Title: Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning
Abstract:
Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at https://github.com/jincan333/MAS-TTS
中文摘要:本文提出了一种自适应多智能体框架,通过模型级训练和系统级协调增强协作推理能力,在多项任务中显著超越基线模型,证明了学习型协作与自适应协调对扩展多智能体推理的重要性。
English Summary: This paper introduces an adaptive multi-agent framework that enhances collaborative reasoning through model-level training and system-level coordination, achieving significant performance improvements across various tasks by combining learned collaboration with adaptive coordination.

Authors:Kang Yang, Guanhong Tao, Xun Chen, Jun Xu
Title: Alleviating the Fear of Losing Alignment in LLM Fine-tuning
Abstract:
Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment
Chinese: 本文提出一种方法,通过在微调后恢复大语言模型的有害方向,采用部分权重还原和回滚机制,将有害回答率从33.25%降至1.74%,同时基本保持任务性能。
English: This paper introduces a method to restore the alignment of large language models compromised during fine-tuning by recovering their harmful direction through partial weight restoration and a rollback mechanism, effectively reducing harmful responses from 33.25% to 1.74% without significant performance loss.

Authors:Zachary J. Wegert, Jordi Manyer, Connor Mallon, Santiago Badia, Vivien J. Challis
Title: Level-set topology optimisation with unfitted finite elements and automatic shape differentiation
Abstract:
In this paper we develop automatic shape differentiation techniques for unfitted discretisations and link these to recent advances in shape calculus for unfitted methods. We extend existing analytic shape calculus results to the case where the domain boundary intersects with the boundary of the background domain. We further show that we can recover these analytic derivatives to machine precision regardless of the mesh size using the developed automatic shape differentiation techniques, drastically reducing the burden associated with the analytic derivation of these quantities. In addition, we show that we can also recover the symmetric shape Hessian. We implement these techniques for both serial and distributed computing frameworks in the Julia package GridapTopOpt and the wider Gridap ecosystem. As part of this implementation we propose a novel graph-based approach for isolated volume detection. We demonstrate the applicability of the unfitted automatic shape differentiation framework and our implementation by considering the three-dimensional minimum compliance topology optimisation of a linear elastic wheel and of a linear elastic structure in a fluid-structure interaction problem with Stokes flow. The implementation is general and allows GridapTopOpt to solve a wider range of problems on unstructured meshes without analytic calculation of shape derivatives and avoiding issues that arise when material properties are smoothed at the domain boundary. The software is open source and available at https://github.com/zjwegert/GridapTopOpt.jl.
中文摘要:本文开发了非拟合离散化的自动形状微分技术,将形状微积分扩展到域边界与背景域相交的情况,并通过Julia软件包GridapTopOpt实现了无需解析推导的精确导数计算。
English Summary: This paper introduces automatic shape differentiation techniques for unfitted discretizations, extending shape calculus to cases where domain boundaries intersect with background domains and enabling machine-precise derivative recovery without analytical derivation.

Authors:Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, James Zou
Title: Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
Abstract:
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.
中文: 该评审反馈代理系统利用大语言模型为同行评审提供自动反馈,通过在ICLR 2025的大规模实验证明,能显著提升评审质量、增加评审长度并促进审稿人参与度。
English: The Review Feedback Agent uses large language models to provide automated feedback on peer reviews, significantly improving review quality, length, and reviewer engagement as demonstrated in a large-scale ICLR 2025 study.

Authors:Gaurav Shinde, Anuradha Ravi, Emon Dey, Shadman Sakib, Milind Rampure, Nirmalya Roy
Title: A Survey on Efficient Vision-Language Models
Abstract:
Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.
中文摘要:本综述探讨了在边缘设备上优化视觉语言模型效率的关键技术,通过精简架构和权衡性能与内存使用来应对计算挑战,并建立了持续更新的GitHub仓库以推动该领域深入研究。
English Summary: This survey examines techniques for optimizing vision-language models to enhance efficiency on edge devices, addressing computational challenges through compact architectures and performance-memory trade-offs while maintaining an updated GitHub repository to support ongoing research.

Authors:Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao
Title: DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
Abstract:
Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.
中文: 本文提出了一种课程学习框架,通过策略优势和置信上界原则自适应地调度不同数据分布的强化学习后训练,从而显著提升大语言模型的收敛速度与最终性能。
English: This paper introduces a curriculum learning framework that adaptively schedules training across diverse data distributions in reinforcement learning-based post-training of large language models, using policy advantages and the Upper Confidence Bound principle to enhance convergence speed and performance.

Authors:Jixiao Zhang, Chunsheng Zuo
Title: GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models
Abstract:
Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.
中文摘要:GRPO-LEAD通过引入长度正则化奖励、显式错误惩罚和难度感知优势重加权,显著提升了数学推理的准确性与简洁性,在140亿参数模型中实现了最优性能。
English Summary: GRPO-LEAD enhances mathematical reasoning by introducing length-regularized rewards, explicit error penalties, and difficulty-aware advantage reweighting, achieving state-of-the-art performance in accuracy and conciseness for 14B-scale models.

Authors:Jiahao Qiu, Yinghui He, Xinzhe Juan, Yimin Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, Mengdi Wang
Title: EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
Abstract:
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent
中文摘要:EmoAgent框架通过EmoEval组件评估AI交互导致的心理健康风险,并利用EmoGuard实时监测干预,实验证明能显著降低脆弱用户群体34.4%以上的心理状态恶化率。
English Summary: The EmoAgent framework addresses mental health risks in human-AI interactions by using EmoEval to assess psychological deterioration through clinical tools and EmoGuard to monitor and mitigate harm, significantly reducing deterioration rates in vulnerable users.

Authors:Yao Yuan, Pan Gao, Qun Dai, Jie Qin, Wei Xiang
Title: Uncertainty Guided Refinement for Fine-Grained Salient Object Detection
Abstract:
Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.
Chinese: 本文提出不确定性引导细化注意力网络(UGRAN),通过多级交互注意力、尺度空间一致性注意力和不确定性细化注意力模块,结合自适应动态分区机制,有效提升模型对不确定区域的感知能力,在多个基准数据集上实现了优于现有方法的显著目标检测性能。
English: This paper introduces the Uncertainty Guided Refinement Attention Network (UGRAN) to address issues of unsaturated regions and shadows in salient object detection by enhancing the model's perception of uncertain regions through innovative modules and an adaptive dynamic partition mechanism, demonstrating superior performance on benchmark datasets.

Authors:Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
Abstract:
Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.
中文: KeyVID是一种音频到视频的框架,它首先从音频中定位关键帧时间点生成视觉关键帧,再通过插值补充中间帧,在保持计算效率的同时显著提升了动态动作的同步性和视频质量。
English: KeyVID is an audio-to-video framework that first identifies key moments from audio to generate visual keyframes, then interpolates intermediate frames, enhancing synchronization and quality for dynamic motions while maintaining computational efficiency.

Authors:Avinash Patil
Title: GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, Triage, and More
Abstract:
Bug reports provide critical insights into software quality, yet existing datasets often suffer from limited scope, outdated content, or insufficient metadata for machine learning. To address these limitations, we present GitBugs-a comprehen- sive and up-to-date dataset comprising over 150,000 bug reports from nine actively maintained open-source projects, including Firefox, Cassandra, and VS Code. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks and predefined train/test splits for duplicate bug detection. In addition, it includes ex- ploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times. GitBugs supports various software engineering research tasks, including duplicate detection, retrieval augmented generation, resolution prediction, automated triaging, and temporal analysis. The openly licensed dataset provides a valuable cross-project resource for bench- marking and advancing automated bug report analysis. Access the data and code at https://github.com/av9ash/gitbugs/.
中文: GitBugs是一个包含超过15万份错误报告的全面、最新数据集,汇集了九个活跃开源项目的数据,通过提供标准化元数据和预定义训练/测试分割,为软件工程研究提供了跨项目的基准分析资源。
English: GitBugs is a comprehensive, up-to-date dataset of over 150,000 bug reports from nine active open-source projects, designed to overcome limitations of existing datasets by providing standardized metadata and supporting various software engineering research tasks.

Authors:Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao
Title: SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Abstract:
Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.
中文摘要:本文提出地理空间像素推理新任务,通过构建EarthReason基准数据集和SegEarth-R1模型,将分层视觉编码与大语言模型相结合,在遥感图像隐式查询推理任务中实现了最先进的性能表现。
English Summary: This paper introduces geospatial pixel reasoning, a novel task for implicit querying and reasoning in remote sensing, supported by the EarthReason dataset and the SegEarth-R1 model that achieves state-of-the-art performance by integrating hierarchical vision encoding with large language models.

Authors:Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
Title: TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Abstract:
Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.
中文:近期强化学习提升了大型多模态模型的推理能力,但针对计算资源有限的研究者,我们提出的小规模视频推理模型TinyLLaVA-Video-R1在通用视频问答数据集上不仅显著增强了推理和思考能力,还展现出“顿悟时刻”的新兴特性,为小模型视频推理探索提供了实用参考。
English: Recent advances in reinforcement learning have enhanced large multimodal models' reasoning, yet small-scale models like TinyLLaVA-Video-R1, with under 4B parameters, show significant reasoning improvements and emergent "aha moments" on general Video-QA datasets, offering practical insights for resource-limited research.

Authors:Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju, Sougata Sen, Sanjay E. Sarma, Archan Misra
Title: Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Abstract:
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.
中文: 本研究提出了ImputeRefer基准数据集和Ges3ViG模型,通过将人类指向手势融入3D场景分析,在三维具身参照理解任务中显著超越了现有方法的性能表现。
English: The study introduces ImputeRefer, a novel benchmark dataset, and Ges3ViG, a model for 3D Embodied Reference Understanding that significantly outperforms existing methods by incorporating human pointing gestures into 3D scene analysis.

Authors:Jiuchen Chen, Xinyu Yan, Qizhi Xu, Kaiqi Li
Title: Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images
Abstract:
Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 $\times$ 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 $\times$ 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at https://github.com/CastleChen339/DehazeXL.
中文: DehazeXL是一种新颖的去雾方法,能有效平衡全局上下文与局部特征,以最小GPU内存实现大尺寸高分辨率图像的端到端处理并达到最优性能,同时提供了新的8KDehaze数据集支持。
English: DehazeXL is a novel haze removal method that effectively balances global context and local features, enabling end-to-end processing of large high-resolution images with minimal GPU memory while achieving state-of-the-art performance, supported by a new 8KDehaze dataset.

Authors:Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan, Lin
Title: Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
Abstract:
Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.
中文:EB-Diff-Train方法通过利用基于时间步重要性的自适应稀疏性早期鸟彩票,在不损失生成质量的前提下显著加速扩散模型训练。
English: The EB-Diff-Train method accelerates diffusion model training by utilizing early-bird tickets that adapt sparsity based on timestep importance, achieving significant speedups without quality loss.

Authors:Zhehao Dong, Zhen Lu, Yue Yang
Title: Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations
Abstract:
Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows. Our code and fine-tuned model have been deposited at https://github.com/YYgroup/AutoCFD.
中文: 本研究通过领域特定的微调大语言模型方法,实现了从自然语言到CFD仿真的自动化配置,在显著超越通用大模型的同时保持了高精度与高效率。
English: This study introduces a domain-specific fine-tuned LLM approach that automates CFD simulation setup through natural language translation, achieving state-of-the-art accuracy and efficiency while outperforming larger general-purpose models.

Authors:Zan Huang
Title: Revisiting Self-Attentive Sequential Recommendation
Abstract:
Recommender systems are ubiquitous in on-line services to drive businesses. And many sequential recommender models were deployed in these systems to enhance personalization. The approach of using the transformer decoder as the sequential recommender was proposed years ago and is still a strong baseline in recent works. But this kind of sequential recommender model did not scale up well, compared to language models. Quite some details in the classical self-attentive sequential recommender model could be revisited, and some new experiments may lead to new findings, without changing the general model structure which was the focus of many previous works. In this paper, we show the details and propose new experiment methodologies for future research on sequential recommendation, in hope to motivate further exploration to new findings in this area.
中文摘要:本文重新审视经典的自注意力序列推荐模型,在不改变核心结构的前提下提出新的实验方法,旨在挖掘新发现并推动该领域的进一步探索。
English Summary: This paper revisits classical self-attentive sequential recommender models, proposing new experimental methodologies to uncover fresh insights without altering the core structure, aiming to inspire further advancements in the field.

Authors:Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution
Abstract:
Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.
Chinese: 受极小自由分解启发,Syzygy of Thoughts (SoT) 通过引入相互关联的推理路径扩展了思维链提示,在多个数据集和模型上实现了更稳健的问题解决能力和更高的推理准确率。
English: Syzygy of Thoughts (SoT) extends Chain-of-Thought prompting by introducing interrelated reasoning paths inspired by Minimal Free Resolution, enabling more robust problem-solving and improved inference accuracy across diverse datasets and models.

Authors:Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, Shanghang Zhang
Title: EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
Abstract:
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.
Chinese Summary: EmbodiedOcc++通过引入几何引导优化模块实现平面结构对齐,并结合语义感知不确定性采样器优化高斯更新,在保持计算效率的同时显著提升了三维占据预测的几何细节与边缘精度。
English Summary: EmbodiedOcc++ enhances 3D occupancy prediction by incorporating a Geometry-guided Refinement Module for planar surface alignment and a Semantic-aware Uncertainty Sampler for optimized Gaussian updates, achieving state-of-the-art performance with improved geometric accuracy.

Authors:Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
Title: Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders
Abstract:
Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.
中文:TraMP-Former框架通过融合面部关键点轨迹特征与视觉语义线索,在神经障碍数据集上实现了最先进的性能,推动了自动化面部表情质量评估的发展。
English: The TraMP-Former framework advances automated facial expression quality assessment by integrating landmark trajectory features and visual semantic cues, achieving state-of-the-art performance on neurological disorder datasets.

Authors:Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang
Title: 3D CoCa: Contrastive Learners are 3D Captioners
Abstract:
3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.
中文: 提出的3D CoCa框架将对比学习与描述生成整合到统一架构中,无需外部检测器即可增强空间推理和语义对齐,在基准测试中实现了最先进的性能。
English: The proposed 3D CoCa framework integrates contrastive learning and caption generation in a unified architecture, achieving state-of-the-art performance on benchmarks by enhancing spatial reasoning and semantic alignment without external detectors.

Authors:Sharanya Dasgupta, Sujoy Nath, Arkaprabha Basu, Pourya Shamsolmoali, Swagatam Das
Title: HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs
Abstract:
Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at https://github.com/sharanya-dasgupta001/hallushift.
Chinese: 本研究提出HalluShift方法,通过分析大语言模型内部状态空间和标记概率的分布偏移来检测其产生的幻觉,在多个基准测试中表现优于现有基线。
English: This study introduces HalluShift, a novel method that detects hallucinations in Large Language Models by analyzing shifts in their internal state space and token probabilities, achieving superior performance across multiple benchmarks.

Authors:Chenbin Zhang, Zhiqiang Hu, Chuchu Jiang, Wen Chen, Jie Xu, Shaoting Zhang
Title: Rethinking the generalization of drug target affinity prediction algorithms via similarity aware evaluation
Abstract:
Drug-target binding affinity prediction is a fundamental task for drug discovery. It has been extensively explored in literature and promising results are reported. However, in this paper, we demonstrate that the results may be misleading and cannot be well generalized to real practice. The core observation is that the canonical randomized split of a test set in conventional evaluation leaves the test set dominated by samples with high similarity to the training set. The performance of models is severely degraded on samples with lower similarity to the training set but the drawback is highly overlooked in current evaluation. As a result, the performance can hardly be trusted when the model meets low-similarity samples in real practice. To address this problem, we propose a framework of similarity aware evaluation in which a novel split methodology is proposed to adapt to any desired distribution. This is achieved by a formulation of optimization problems which are approximately and efficiently solved by gradient descent. We perform extensive experiments across five representative methods in four datasets for two typical target evaluations and compare them with various counterpart methods. Results demonstrate that the proposed split methodology can significantly better fit desired distributions and guide the development of models. Code is released at https://github.com/Amshoreline/SAE/tree/main.
中文: 本文指出传统药物靶点结合力评估因测试集与训练集高度相似而产生误导性结果,并提出一种基于优化分割的相似性感知评估框架,能更准确地反映模型在真实场景中的性能。
English: This paper reveals that conventional drug-target binding affinity evaluations are misleading due to test sets dominated by high-similarity samples, and proposes a similarity-aware framework with optimized data splitting to better reflect real-world performance.

Authors:Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang
Title: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Abstract:
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.
Chinese: 本研究首次对视觉语言模型在多种检测与分割任务中进行系统性评估,揭示了其性能特点,并为未来模型设计与训练策略提供了重要见解。
English: This study conducts the first comprehensive evaluation of Vision-Language Models (VLMs) across multiple detection and segmentation tasks, revealing their performance characteristics and providing insights into model design and training strategies.

Authors:Sacheendra Talluri, Dante Niewenhuis, Xiaoyu Chu, Jakob Kyselica, Mehmet Cetin, Alexander Balgavy, Alexandru Iosup
Title: Cloud Uptime Archive: Open-Access Availability Data of Web, Cloud, and Gaming Services
Abstract:
Cloud services are critical to society. However, their reliability is poorly understood. Towards solving the problem, we propose a standard repository for cloud uptime data. We populate this repository with the data we collect containing failure reports from users and operators of cloud services, web services, and online games. The multiple vantage points help reduce bias from individual users and operators. We compare our new data to existing failure data from the Failure Trace Archive and the Google cluster trace. We analyze the MTBF and MTTR, time patterns, failure severity, user-reported symptoms, and operator-reported symptoms of failures in the data we collect. We observe that high-level user facing services fail less often than low-level infrastructure services, likely due to them using fault-tolerance techniques. We use simulation-based experiments to demonstrate the impact of different failure traces on the performance of checkpointing and retry mechanisms. We release the data, and the analysis and simulation tools, as open-source artifacts available at https://github.com/atlarge-research/cloud-uptime-archive .
中文摘要:本研究建立了一个标准化的云服务运行时间数据存储库,通过分析故障模式并利用模拟实验展示不同故障轨迹对检查点和重试机制性能的影响,所有数据与工具均已开源发布。
English Summary: This study introduces a standardized repository for cloud uptime data, analyzing failure patterns and demonstrating through simulations how different failure traces affect checkpointing and retry mechanisms, with all resources made publicly available.

Authors:Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
Title: D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation
Abstract:
Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.
中文摘要:本文提出了一种新颖的两阶段框架,通过动态变分自编码器和动态扩散变换器根据图像区域的信息密度进行自适应压缩,利用多粒度噪声预测实现了生成图像全局一致性与局部真实性的统一提升。
English Summary: This paper introduces a novel two-stage framework with Dynamic VAE and Dynamic Diffusion Transformer that dynamically compresses image regions based on their information density, achieving enhanced global consistency and local realism in generated images through multi-grained noise prediction.

Authors:Lin Zhu, Xinbing Wang, Chenghu Zhou, Nanyang Ye
Title: Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization
Abstract:
Recent advances in large pre-trained models showed promising results in few-shot learning. However, their generalization ability on two-dimensional Out-of-Distribution (OoD) data, i.e., correlation shift and diversity shift, has not been thoroughly investigated. Researches have shown that even with a significant amount of training data, few methods can achieve better performance than the standard empirical risk minimization method (ERM) in OoD generalization. This few-shot OoD generalization dilemma emerges as a challenging direction in deep neural network generalization research, where the performance suffers from overfitting on few-shot examples and OoD generalization errors. In this paper, leveraging a broader supervision source, we explore a novel Bayesian cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Specifically, the model is designed as only text representations are fine-tuned via a Bayesian modelling approach with gradient orthogonalization loss and invariant risk minimization (IRM) loss. The Bayesian approach is essentially introduced to avoid overfitting the base classes observed during training and improve generalization to broader unseen classes. The dedicated loss is introduced to achieve better image-text alignment by disentangling the causal and non-casual parts of image features. Numerical experiments demonstrate that Bayes-CAL achieved state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Moreover, compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes. Our code is available at https://github.com/LinLLLL/BayesCAL.
中文: 本文提出的贝叶斯跨模态对齐方法Bayes-CAL通过专门设计的损失函数微调文本表征,有效解决了小样本分布外泛化问题,在未见类别上展现出更稳定的性能表现。
English: This paper introduces Bayes-CAL, a Bayesian cross-modal alignment method that addresses few-shot OoD generalization by fine-tuning text representations with specialized losses to prevent overfitting and improve stability on unseen classes.

Authors:Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
Title: ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
Abstract:
Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.
Chinese: ClinicalGPT-R1作为基于2万份临床记录训练的推理增强大语言模型,在中文诊断任务中超越GPT-4o,英文场景与GPT-4表现相当,其卓越诊断能力已通过MedBench-Hard基准验证。
English: ClinicalGPT-R1, a reasoning-enhanced LLM trained on 20,000 clinical records, surpasses GPT-4o in Chinese diagnostic tasks and matches GPT-4 in English, as validated on the challenging MedBench-Hard dataset.

Authors:Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin
Title: Gradient as Conditions: Rethinking HOG for All-in-one Image Restoration
Abstract:
All-in-one image restoration (AIR) aims to address diverse degradations within a unified model by leveraging informative degradation conditions to guide the restoration process. However, existing methods often rely on implicitly learned priors, which may entangle feature representations and hinder performance in complex or unseen scenarios. Histogram of Oriented Gradients (HOG) as a classical gradient representation, we observe that it has strong discriminative capability across diverse degradations, making it a powerful and interpretable prior for AIR. Based on this insight, we propose HOGformer, a Transformer-based model that integrates learnable HOG features for degradation-aware restoration. The core of HOGformer is a Dynamic HOG-aware Self-Attention (DHOGSA) mechanism, which adaptively models long-range spatial dependencies conditioned on degradation-specific cues encoded by HOG descriptors. To further adapt the heterogeneity of degradations in AIR, we propose a Dynamic Interaction Feed-Forward (DIFF) module that facilitates channel-spatial interactions, enabling robust feature transformation under diverse degradations. Besides, we propose a HOG loss to explicitly enhance structural fidelity and edge sharpness. Extensive experiments on a variety of benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes well to complex real-world scenarios.Code is available at https://github.com/Fire-friend/HOGformer.
Chinese: HOGformer是一种基于Transformer的模型,它利用方向梯度直方图(HOG)作为可解释的先验知识进行一体化图像恢复,通过动态HOG感知自注意力机制和自适应特征交互模块,实现了最先进的性能表现。
English: HOGformer is a Transformer-based model that utilizes Histogram of Oriented Gradients (HOG) as an interpretable prior for all-in-one image restoration, achieving state-of-the-art performance by integrating dynamic HOG-aware self-attention and adaptive feature interaction modules.

Authors:Iason Chaimalas, Arnas Vyšniauskas, Gabriel Brostow
Title: Explorer: Robust Collection of Interactable GUI Elements
Abstract:
Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.
Chinese: Explorer系统通过专注于检测按钮和文本字段等交互元素,利用实时应用数据训练机器学习模型,实现个性化的图形用户界面自动化,从而提供精确的用户特定可访问性,并通过语音命令进行路径规划。
English: The Explorer system enables personalized automation of graphical user interfaces by focusing on detecting interactive elements like buttons and text fields, using live application data to train machine learning models for precise, user-specific accessibility and path planning through audio commands.

Authors:Simon Adamov, Joel Oskarsson, Leif Denby, Tomas Landelius, Kasper Hintz, Simon Christiansen, Irene Schicker, Carlos Osuna, Fredrik Lindsten, Oliver Fuhrer, Sebastian Schemm
Title: Building Machine Learning Limited Area Models: Kilometer-Scale Weather Forecasting in Realistic Settings
Abstract:
Machine learning is revolutionizing global weather forecasting, with models that efficiently produce highly accurate forecasts. Apart from global forecasting there is also a large value in high-resolution regional weather forecasts, focusing on accurate simulations of the atmosphere for a limited area. Initial attempts have been made to use machine learning for such limited area scenarios, but these experiments do not consider realistic forecasting settings and do not investigate the many design choices involved. We present a framework for building kilometer-scale machine learning limited area models with boundary conditions imposed through a flexible boundary forcing method. This enables boundary conditions defined either from reanalysis or operational forecast data. Our approach employs specialized graph constructions with rectangular and triangular meshes, along with multi-step rollout training strategies to improve temporal consistency. We perform systematic evaluation of different design choices, including the boundary width, graph construction and boundary forcing integration. Models are evaluated across both a Danish and a Swiss domain, two regions that exhibit different orographical characteristics. Verification is performed against both gridded analysis data and in-situ observations, including a case study for the storm Ciara in February 2020. Both models achieve skillful predictions across a wide range of variables, with our Swiss model outperforming the numerical weather prediction baseline for key surface variables. With their substantially lower computational cost, our findings demonstrate great potential for machine learning limited area models in the future of regional weather forecasting.
中文: 机器学习通过采用灵活边界条件和专业图结构的新框架,推动了高分辨率区域天气预报的发展,实现了在降低计算成本的同时获得精准预测的能力。
English: Machine learning is advancing regional weather forecasting through a novel framework that enables high-resolution, kilometer-scale models with flexible boundary conditions and specialized graph constructions, achieving skillful predictions with lower computational costs.

Authors:Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai
Title: FVQ: A Large-Scale Dataset and an LMM-based Method for Face Video Quality Assessment
Abstract:
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
中文: 本研究提出了首个大规模人脸视频质量评估数据集FVQ-20K,并开发了FVQ-Rater方法,通过融合多模态特征和指令微调技术实现类人化质量评分,为推进人脸视频质量评估领域发展展现出重要潜力。
English: This study introduces FVQ-20K, the first large-scale dataset for face video quality assessment (FVQA), and proposes FVQ-Rater, a novel method leveraging multimodal features and instruction tuning to achieve human-like quality evaluation, demonstrating significant potential for advancing FVQA research.

Authors:Yomna Mokhtar, Tarek Shohdy, Abdallah A. Hassan, Mostafa Eshra, Omar Elmenawy, Osama Khalil, Haitham El-Hussieny
Title: Development of a PPO-Reinforcement Learned Walking Tripedal Soft-Legged Robot using SOFA
Abstract:
Rigid robots were extensively researched, whereas soft robotics remains an underexplored field. Utilizing soft-legged robots in performing tasks as a replacement for human beings is an important stride to take, especially under harsh and hazardous conditions over rough terrain environments. For the demand to teach any robot how to behave in different scenarios, a real-time physical and visual simulation is essential. When it comes to soft robots specifically, a simulation framework is still an arduous problem that needs to be disclosed. Using the simulation open framework architecture (SOFA) is an advantageous step. However, neither SOFA's manual nor prior public SOFA projects show its maximum capabilities the users can reach. So, we resolved this by establishing customized settings and handling the framework components appropriately. Settling on perfect, fine-tuned SOFA parameters has stimulated our motivation towards implementing the state-of-the-art (SOTA) reinforcement learning (RL) method of proximal policy optimization (PPO). The final representation is a well-defined, ready-to-deploy walking, tripedal, soft-legged robot based on PPO-RL in a SOFA environment. Robot navigation performance is a key metric to be considered for measuring the success resolution. Although in the simulated soft robots case, an 82\% success rate in reaching a single goal is a groundbreaking output, we pushed the boundaries to further steps by evaluating the progress under assigning a sequence of goals. While trailing the platform steps, outperforming discovery has been observed with an accumulative squared error deviation of 19 mm. The full code is publicly available at \href{https://github.com/tarekshohdy/PPO_SOFA_Soft_Legged_Robot.git}{github.com/tarekshohdy/PPO$\textunderscore$SOFA$\textunderscore$Soft$\textunderscore$Legged$\textunderscore$ Robot.git}
Chinese: 本研究利用SOFA仿真框架和PPO强化学习开发了一款软体三足机器人,在单目标导航中达到82%成功率,多目标追踪误差仅19毫米,相关代码已开源。
English: This study develops a soft-legged tripedal robot using the SOFA simulation framework and PPO reinforcement learning, achieving an 82% success rate in single-goal navigation and a 19 mm deviation in multi-goal tasks, with code publicly released.

Authors:You Wu, Xucheng Wang, Xiangyang Yang, Mengyuan Liu, Dan Zeng, Hengzhou Ye, Shuiwang Li
Title: Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
Abstract:
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.
中文: 本研究提出OR Track框架,通过基于空间Cox过程的随机掩码学习抗遮挡特征表示,以增强单流视觉Transformer模型在无人机跟踪中的性能,并利用自适应知识蒸馏开发出高效的实时版本OR Track-D。
English: This study introduces ORTrack, a novel framework that enhances single-stream Vision Transformer models for UAV tracking by learning occlusion-robust representations through spatial Cox process-based random masking, and develops ORTrack-D via adaptive knowledge distillation for efficient real-time performance.

Authors:Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, Nikos Deligiannis
Title: ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking
Abstract:
Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT
中文: ReferGPT是一种零样本多目标跟踪框架,通过具备空间知识的多模态大语言模型生成三维感知描述,并采用基于CLIP的语义匹配策略,在无需训练的情况下于自动驾驶基准测试中展现出竞争优势。
English: ReferGPT is a zero-shot framework that uses a multi-modal large language model with spatial knowledge to generate 3D-aware captions and employs a CLIP-based matching strategy for robust object tracking without training, achieving competitive performance in autonomous driving benchmarks.

Authors:Shengyu Gong, Yueyang Li, Zijian Kang, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang
Title: LEL: A Novel Lipschitz Continuity-constrained Ensemble Learning Model for EEG-based Emotion Recognition
Abstract:
The accurate and efficient recognition of emotional states in oneself and others is critical, as impairments in this ability can lead to significant psychosocial difficulties. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we introduce LEL (Lipschitz continuity-constrained Ensemble Learning), a novel framework that enhances EEG-based emotion recognition. By integrating Lipschitz continuity constraints, LEL ensures greater model stability and improves generalization, thereby reducing sensitivity to signal variability and noise while significantly boosting the model's overall accuracy and robustness. Its ensemble learning strategy optimizes overall performance by fusing decisions from multiple classifiers to reduce single-model bias and variance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrated the LEL's state-of-the-art performance, achieving average recognition accuracies of 76.43%, 83.00% and 87.22%, respectively. The official implementation codes are released at https://github.com/NZWANG/LEL.
Chinese: LEL框架通过整合Lipschitz连续性约束和集成学习,显著提升了基于脑电图的情绪识别性能,在多个基准数据集上实现了最优的准确性和鲁棒性。
English: The LEL framework enhances EEG-based emotion recognition by integrating Lipschitz continuity constraints and ensemble learning, achieving state-of-the-art accuracy and robustness on benchmark datasets.

Authors:Yunfei Long, Abhinav Kumar, Xiaoming Liu, Daniel Morris
Title: RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection
Abstract:
Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.
Chinese: 本文提出一种方法,通过显式建模雷达命中分布来改进雷达-相机融合,利用预测核与上下文融合优化匹配得分,在nuScenes数据集上实现了最先进的检测性能。
English: This paper introduces a method that explicitly models radar hit distributions to enhance radar-camera fusion, achieving state-of-the-art detection performance on nuScenes by refining matching scores through predicted kernels and contextual fusion.

Authors:Matt Grenander, Siddharth Varia, Paula Czarnowska, Yogarshi Vyas, Kishaloy Halder, Bonan Min
Title: Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models
Abstract:
Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts. Code available at https://github.com/amazon-science/plan-guided-summarization
Chinese: 计划引导的摘要方法在小型语言模型中未能显著提升长篇叙事文本摘要的质量或忠实度,因为计划本身同样容易出现虚构内容,导致该方法效果不佳。
English: Plan-guided summarization in small language models does not significantly enhance the quality or faithfulness of summaries for long narrative texts, as plans themselves are prone to hallucinations, rendering the approach ineffective.

Authors:Zhijie Shen, Chunyu Lin, Shujuan Huang, Lang Nie, Kang Liao, Yao Zhao
Title: You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting
Abstract:
Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.
中文摘要:TPGS框架通过引入过渡平面和优化策略,解决了全景3D高斯溅射中的失真和边界问题,实现了更优的场景重建效果。
English Summary: The TPGS framework addresses distortion and boundary issues in panoramic 3D Gaussian splatting by introducing a transition plane and optimization strategy to achieve superior scene reconstruction.

Authors:Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
Title: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Abstract:
Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.
中文: 本研究提出了一种新颖的幽默检测指标,用于评估大语言模型从单口喜剧文本中识别幽默笑点的能力,结果显示顶尖模型最高达到51%的准确率——超过人类评估者的41%——同时揭示了幽默提取的主观性与复杂性。
English: This study introduces a novel humor detection metric to evaluate large language models' ability to identify humorous punchlines from stand-up comedy transcripts, revealing that top models achieve up to 51% accuracy—surpassing human evaluators' 41%—while highlighting the subjectivity and complexity of humor extraction.

Authors:Yongchang Wu, Zipeng Qi, Zhenwei Shi, Zhengxia Zou
Title: BlockGaussian: Efficient Large-Scale Scene Novel View Synthesis via Adaptive Block-Based Gaussian Splatting
Abstract:
The recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable potential in novel view synthesis tasks. The divide-and-conquer paradigm has enabled large-scale scene reconstruction, but significant challenges remain in scene partitioning, optimization, and merging processes. This paper introduces BlockGaussian, a novel framework incorporating a content-aware scene partition strategy and visibility-aware block optimization to achieve efficient and high-quality large-scale scene reconstruction. Specifically, our approach considers the content-complexity variation across different regions and balances computational load during scene partitioning, enabling efficient scene reconstruction. To tackle the supervision mismatch issue during independent block optimization, we introduce auxiliary points during individual block optimization to align the ground-truth supervision, which enhances the reconstruction quality. Furthermore, we propose a pseudo-view geometry constraint that effectively mitigates rendering degradation caused by airspace floaters during block merging. Extensive experiments on large-scale scenes demonstrate that our approach achieves state-of-the-art performance in both reconstruction efficiency and rendering quality, with a 5x speedup in optimization and an average PSNR improvement of 1.21 dB on multiple benchmarks. Notably, BlockGaussian significantly reduces computational requirements, enabling large-scale scene reconstruction on a single 24GB VRAM device. The project page is available at https://github.com/SunshineWYC/BlockGaussian
中文:BlockGaussian通过内容感知的场景分割和可见性感知的块优化,实现了高效的大规模3D场景重建,在优化速度提升5倍的同时PSNR提高1.21分贝,并显著降低了硬件需求。
English: BlockGaussian introduces a content-aware partitioning and visibility-aware optimization framework that achieves efficient large-scale 3D scene reconstruction with 5x faster optimization and 1.21 dB PSNR improvement while reducing hardware requirements.

Authors:Shubham Aggarwal, Dipankar Maity, Tamer Başar
Title: InterQ: A DQN Framework for Optimal Intermittent Control
Abstract:
In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system's state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at https://github.com/AC-sh/InterQ.
中文摘要:本文通过强化学习探索了随机线性系统的通信与控制协同设计,提出的InterQ算法利用深度神经网络逼近Q函数,在平衡通信成本与控制性能方面优于周期性调度和事件触发两种基准策略。
English Summary: This letter presents a reinforcement learning-based co-design of communication and control for stochastic linear systems, introducing the InterQ algorithm that outperforms baseline scheduling policies by optimizing transmission intervals to balance communication costs and control performance.

Authors:Jiawei Li
Title: Detecting Instruction Fine-tuning Attack on Language Models with Influence Function
Abstract:
Instruction fine-tuning attacks pose a significant threat to large language models (LLMs) by subtly embedding poisoned data in fine-tuning datasets, which can trigger harmful or unintended responses across a range of tasks. This undermines model alignment and poses security risks in real-world deployment. In this work, we present a simple and effective approach to detect and mitigate such attacks using influence functions, a classical statistical tool adapted for machine learning interpretation. Traditionally, the high computational costs of influence functions have limited their application to large models and datasets. The recent Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation method enables efficient influence score computation, making it feasible for large-scale analysis. We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets, as both the instruction fine-tuning attack on language models and the influence calculation approximation technique are relatively new. Our large-scale empirical evaluation of influence functions on 50,000 fine-tuning examples and 32 tasks reveals a strong association between influence scores and sentiment. Building on this, we introduce a novel sentiment transformation combined with influence functions to detect and remove critical poisons -- poisoned data points that skew model predictions. Removing these poisons (only 1% of total data) recovers model performance to near-clean levels, demonstrating the effectiveness and efficiency of our approach. Artifact is available at https://github.com/lijiawei20161002/Poison-Detection. WARNING: This paper contains offensive data examples.
中文: 本文提出了一种针对大型语言模型指令微调攻击的新型检测方法,该方法利用语义变换下的影响函数来识别关键毒化样本,无需攻击先验知识,仅需移除约1%的毒化数据即可将模型性能恢复至接近正常水平。
English: This paper introduces a novel detection method for instruction finetuning attacks on LLMs that uses influence functions under semantic transformation to identify critical poison examples without prior knowledge of the attack, effectively restoring model performance by removing just 1% of poisoned data.

Authors:Jiawei Li
Title: Detecting Instruction Fine-tuning Attacks on Language Models using Influence Function
Abstract:
Instruction finetuning attacks pose a serious threat to large language models (LLMs) by subtly embedding poisoned examples in finetuning datasets, leading to harmful or unintended behaviors in downstream applications. Detecting such attacks is challenging because poisoned data is often indistinguishable from clean data and prior knowledge of triggers or attack strategies is rarely available. We present a detection method that requires no prior knowledge of the attack. Our approach leverages influence functions under semantic transformation: by comparing influence distributions before and after a sentiment inversion, we identify critical poison examples whose influence is strong and remain unchanged before and after inversion. We show that this method works on sentiment classification task and math reasoning task, for different language models. Removing a small set of critical poisons (about 1% of the data) restores the model performance to near-clean levels. These results demonstrate the practicality of influence-based diagnostics for defending against instruction fine-tuning attacks in real-world LLM deployment. Artifact available at https://github.com/lijiawei20161002/Poison-Detection. WARNING: This paper contains offensive data examples.
中文: 本文提出了一种针对大型语言模型指令微调攻击的新型检测方法,该方法利用语义变换下的影响函数来识别关键毒化样本,无需攻击先验知识,仅需移除约1%的毒化数据即可将模型性能恢复至接近正常水平。
English: This paper introduces a novel detection method for instruction finetuning attacks on LLMs that uses influence functions under semantic transformation to identify critical poison examples without prior knowledge of the attack, effectively restoring model performance by removing just 1% of poisoned data.

Authors:Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, Ziyue Yang
Title: MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
Abstract:
Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help in quick development and optimization, but incur a lot of redundant efforts across applications in writing non-portable code. This paper discusses an alternative communication library interface for AI applications that offers both portability and performance by reducing redundant efforts while maintaining flexibility for customization. We present MSCCL++, a novel abstraction of GPU communication based on separation of concerns: (1) a primitive interface provides a minimal hardware abstraction as a common ground for software and hardware developers to write custom communication, and (2) higher-level portable interfaces and specialized implementations enable optimization for different workloads and hardware environments. This approach makes the primitive interface reusable across applications while enabling highly flexible optimization. Compared to state-of-the-art baselines (NCCL, RCCL, and MSCCL), MSCCL++ achieves speedups of up to 5.4$\times$ for collective communication and up to 15% for real-world AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and is also adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open-source and available at https://github.com/microsoft/mscclpp.
中文:MSCCL++提出了一种创新的GPU通信库接口,通过分离硬件抽象与优化层级,在减少跨平台重复编码工作的同时,为AI应用提供兼具可移植性与高性能的通信解决方案。
English: MSCCL++ introduces a novel GPU communication library interface that separates hardware abstraction from optimization layers, enabling portable high-performance AI applications while reducing redundant coding efforts across platforms.

Authors:Han Liao, Shuaishuai Zu
Title: RouterKT: Mixture-of-Experts for Knowledge Tracing
Abstract:
Knowledge Tracing (KT) is a fundamental task in Intelligent Tutoring Systems (ITS), which aims to model the dynamic knowledge states of students based on their interaction histories. However, existing KT models often rely on a global forgetting decay mechanism for capturing learning patterns, assuming that students' performance is predominantly influenced by their most recent interactions. Such approaches fail to account for the diverse and complex learning patterns arising from individual differences and varying learning stages. To address this limitation, we propose RouterKT, a novel Mixture-of-Experts (MoE) architecture designed to capture heterogeneous learning patterns by enabling experts to specialize in different patterns without any handcrafted learning pattern bias such as forgetting decay. Specifically, RouterKT introduces a \textbf{person-wise routing mechanism} to effectively model individual-specific learning behaviors and employs \textbf{multi-heads as experts} to enhance the modeling of complex and diverse patterns. Comprehensive experiments on ten benchmark datasets demonstrate that RouterKT exhibits significant flexibility and improves the performance of various KT backbone models, with a maximum average AUC improvement of 3.29\% across different backbones and datasets, outperforming other state-of-the-art models. Moreover, RouterKT demonstrates consistently superior inference efficiency compared to existing approaches based on handcrafted learning pattern bias, highlighting its usability for real-world educational applications. The source code is available at https://github.com/ringotc/RouterKT.git.
中文: RouterKT通过引入专家混合架构和个性化路由机制,无需依赖全局遗忘衰减假设即可捕捉多样化学习模式,在多个基准测试中显著提升了知识追踪的性能与推理效率。
English: RouterKT introduces a Mixture-of-Experts architecture with person-wise routing to capture diverse learning patterns without relying on global forgetting mechanisms, significantly improving knowledge tracing performance and efficiency across multiple benchmarks.

Authors:Xijin Ge
Title: DataMap: A Portable Application for Visualizing High-Dimensional Data
Abstract:
Motivation: The visualization and analysis of high-dimensional data are essential in biomedical research. There is a need for secure, scalable, and reproducible tools to facilitate data exploration and interpretation. Results: We introduce DataMap, a browser-based application for visualization of high-dimensional data using heatmaps, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). DataMap runs in the web browser, ensuring data privacy while eliminating the need for installation or a server. The application has an intuitive user interface for data transformation, annotation, and generation of reproducible R code. Availability and Implementation: Freely available as a GitHub page https://gexijin.github.io/datamap/. The source code can be found at https://github.com/gexijin/datamap, and can also be installed as an R package. Contact: Xijin.Ge@sdstate.ed
中文:DataMap是一款基于浏览器的安全工具,可通过热图、PCA和t-SNE可视化高维生物医学数据,无需安装即可保障数据隐私并生成可复现的R代码。
English: DataMap is a secure, browser-based tool for visualizing high-dimensional biomedical data through heatmaps, PCA, and t-SNE, offering data privacy and reproducible R code without installation.

Authors:Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang
Title: Mimic In-Context Learning for Multimodal Tasks
Abstract:
Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.
Chinese: 摘要介绍了模仿上下文学习(MimIC)方法,该方法通过四项关键改进从上下文示例中学习稳定的偏移效应,以增强大型多模态模型,在多模态任务中表现优于现有方法。
English: The abstract introduces Mimic In-Context Learning (MimIC), a method that enhances Large Multimodal Models by learning stable shift effects from in-context demonstrations through four key improvements, outperforming existing approaches in multimodal tasks.

Authors:Vasiliki Tassopoulou, Haochang Shou, Christos Davatzikos
Title: Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories
Abstract:
Longitudinal biomedical studies monitor individuals over time to capture dynamics in brain development, disease progression, and treatment effects. However, estimating trajectories of brain biomarkers is challenging due to biological variability, inconsistencies in measurement protocols (e.g., differences in MRI scanners), scarcity, and irregularity in longitudinal measurements. Herein, we introduce a novel personalized deep kernel regression framework for forecasting brain biomarkers, with application to regional volumetric measurements. Our approach integrates two key components: a population model that captures brain trajectories from a large and diverse cohort, and a subject-specific model that captures individual trajectories. To optimally combine these, we propose Adaptive Shrinkage Estimation, which effectively balances population and subject-specific models. We assess our model's performance through predictive accuracy metrics, uncertainty quantification, and validation against external clinical studies. Benchmarking against state-of-the-art statistical and machine learning models -- including linear mixed effects models, generalized additive models, and deep learning methods -- demonstrates the superior predictive performance of our approach. Additionally, we apply our method to predict trajectories of composite neuroimaging biomarkers, which highlights the versatility of our approach in modeling the progression of longitudinal neuroimaging biomarkers. Furthermore, validation on three external neuroimaging studies confirms the robustness of our method across different clinical contexts. We make the code available at https://github.com/vatass/AdaptiveShrinkageDKGP.
中文: 本研究提出了一种个性化深度核回归框架,通过自适应收缩估计整合群体与个体模型,能够精确预测脑生物标志物,在预测精度和跨研究验证方面均优于现有方法。
English: This study introduces a personalized deep kernel regression framework that combines population and individual models through Adaptive Shrinkage Estimation to accurately forecast brain biomarkers, demonstrating superior performance over existing methods in predictive accuracy and cross-study validation.

Authors:Zirui Chen, Zhaoyang Zhang, Ziqing Xing, Ridong Li, Zhaohui Yang, Richeng Jin, Chongwen Huang, Yuzhi Yang, Mérouane Debbah
Title: Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization
Abstract:
Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is primarily due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite that data of each scenario has a distinct reference frame, its generation generally follows common underlying physical rules. Based on this understanding, this article proposes a deep learning framework named analogical learning (AL), which implicitly retrieves the reference frame information associated with a scenario and then to make accurate prediction by relative analogy with other scenarios. Specifically, we design a bipartite neural network called Mateformer. Its first part captures the relativity within multiple latent feature spaces between the input data and a small amount of embedded data from the studied scenario, while its second part uses this relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments validate AL's superiority across three key dimensions. First, it achieves state-of-the-art accuracy in single-scenario benchmarks. Second, it demonstrates stable transferability between different scenarios, avoiding catastrophic forgetting. Finally, and most importantly, it robustly adapts to new, unseen scenarios--including dynamic weather and traffic conditions--without any tuning. All data and code are available at https://github.com/ziruichen-research/ALLoc.
中文:该文提出的类比学习框架通过Mateformer神经网络实现,能隐式适应特定场景的参考框架并利用共享物理规律,从而在无需重新训练的情况下,于动态环境中实现卓越的准确性、可迁移性和鲁棒性。
English: The proposed analogical learning framework, implemented via the Mateformer neural network, enhances generalization by implicitly adapting to scenario-specific reference frames while leveraging shared physical principles, achieving superior accuracy, transferability, and robustness in dynamic environments without retraining.

Authors:Zheyuan Lai, Yingming Pu
Title: PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration
Abstract:
Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard-interpretability. To bridge this gap, we introduce a principles-guided material discovery system powered by language inferential multi-agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated-and-transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \href{https://github.com/amair-lab/PriM}{GitHub}.
中文摘要:PriM系统采用基于语言的多智能体方法,通过原则引导与实验验证相结合,实现了材料发现的高效探索与透明推理。
English Summary: The PriM system uses a language-based multi-agent approach to automate material discovery, enhancing exploration efficiency and interpretability through principled guidance and experimental validation.

Authors:Zhengke Sun, Hangwei Qian, Ivor Tsang
Title: Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models
Abstract:
Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at https://github.com/zachysun/TS-Lang-Exp.
中文摘要:本研究质疑文本数据在时间序列大语言模型中的有效性,发现由于模态不匹配,文本信息通常无法提升预测性能且缺乏可解释性。
English Summary: This study questions the effectiveness of text integration in time series forecasting with LLMs, finding that textual data often fails to improve performance or provide clear interpretability due to modality misalignment.

Authors:Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
Title: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
Abstract:
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.
中文摘要:本文提出prima.cpp分布式推理系统,通过优化CPU/GPU资源分配和内存管理,实现在普通家用设备上高效运行70B参数大语言模型,同时保持低内存占用。
English Summary: The paper introduces prima.cpp, a distributed inference system that enables efficient running of large 70B-scale language models on standard home devices by optimizing resource allocation across CPUs and GPUs while maintaining low memory usage.

Authors:Lucas Beerens, Desmond J. Higham
Title: Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models
Abstract:
We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at https://github.com/LucasBeerens/CRAFTed-Diffusion .
中文摘要:本研究提出一种通过微调将对抗性功能嵌入扩散模型的隐蔽攻击方法,使模型在生成看似正常图像的同时能有效误导下游分类器,且用户难以察觉其恶意性质。
English Summary: This study introduces a stealthy attack method that embeds adversarial capabilities into diffusion models through fine-tuning, enabling them to generate seemingly normal images that reliably mislead downstream classifiers while remaining undetectable to users.

Authors:Yuxuan Chen, Dewen Guo, Sen Mei, Xinze Li, Hao Chen, Yishan Li, Yixuan Wang, Chaoyue Tang, Ruobing Wang, Dingjun Wu, Yukun Yan, Zhenghao Liu, Shi Yu, Zhiyuan Liu, Maosong Sun
Title: UltraRAG: A Modular and Automated Toolkit for Adaptive Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) significantly enhances the performance of large language models (LLMs) in downstream tasks by integrating external knowledge. To facilitate researchers in deploying RAG systems, various RAG toolkits have been introduced. However, many existing RAG toolkits lack support for knowledge adaptation tailored to specific application scenarios. To address this limitation, we propose UltraRAG, a RAG toolkit that automates knowledge adaptation throughout the entire workflow, from data construction and training to evaluation, while ensuring ease of use. UltraRAG features a user-friendly WebUI that streamlines the RAG process, allowing users to build and optimize systems without coding expertise. It supports multimodal input and provides comprehensive tools for managing the knowledge base. With its highly modular architecture, UltraRAG delivers an end-to-end development solution, enabling seamless knowledge adaptation across diverse user scenarios. The code, demonstration videos, and installable package for UltraRAG are publicly available at https://github.com/OpenBMB/UltraRAG.
Chinese: UltraRAG是一种新型工具包,通过自动化知识适配和提供用户友好的Web界面,无需编程即可实现跨场景的RAG系统端到端开发。
English: UltraRAG is a novel toolkit that automates knowledge adaptation across the entire workflow, offering a user-friendly WebUI and modular architecture for seamless RAG system development without coding.

Authors:Yang Yang, Tong Zhang, Jian Wu, Lijie Su
Title: Dynamic Topic Analysis in Academic Journals using Convex Non-negative Matrix Factorization Method
Abstract:
With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI's understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates convex optimization to improve topic consistency, sparsity, and interpretability. In Stage 1, a two-layer non-negative matrix factorization (NMF) model is employed to extract annual topics and identify key terms. In Stage 2, a convex optimization algorithm refines the dynamic topic structure using the convex NMF (cNMF) model, further enhancing topic integration and stability. Applying the proposed method to IEEE journal abstracts from 2004 to 2022 effectively identifies and quantifies emerging research topics, such as COVID-19 and digital twins. By optimizing sparsity differences in the clustering feature space between traditional and emerging research topics, the framework provides deeper insights into topic evolution and ranking analysis. Moreover, the NMF-cNMF model demonstrates superior stability in topic consistency. At sparsity levels of 0.4, 0.6, and 0.9, the proposed approach improves topic ranking stability by 24.51%, 56.60%, and 36.93%, respectively. The source code (to be open after publication) is available at https://github.com/meetyangyang/CDNMF.
本文提出了一种采用凸优化的两阶段动态主题分析框架,通过非负矩阵分解和凸非负矩阵分解模型提升主题一致性与稳定性,有效识别了2004至2022年IEEE摘要中COVID-19和数字孪生等新兴研究主题。
This paper introduces a two-stage dynamic topic analysis framework using convex optimization to enhance topic consistency and stability, effectively identifying emerging research trends like COVID-19 and digital twins in IEEE abstracts from 2004 to 2022.

Authors:Anton Thielmann, Arik Reuter, Benjamin Saefken
Title: Beyond Black-Box Predictions: Identifying Marginal Feature Effects in Tabular Transformer Networks
Abstract:
In recent years, deep neural networks have showcased their predictive power across a variety of tasks. Beyond natural language processing, the transformer architecture has proven efficient in addressing tabular data problems and challenges the previously dominant gradient-based decision trees in these areas. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model's predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is available at https://github.com/OpenTabular/NAMpy.
Chinese: 所提出的表格Transformer网络改进方案通过准确识别边际特征效应,在保持与黑盒模型相当预测性能的同时,弥合了可解释性与预测能力之间的鸿沟。
English: The proposed adaptation of tabular transformer networks bridges the gap between predictive performance and intelligibility by accurately identifying marginal feature effects while matching black-box model capabilities.

Authors:Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, Laurent Callot
Title: SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
Abstract:
Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code refactoring. We provide a task and repository-stratified subsample (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. To enable a more comprehensive comparison of coding agents, this work also presents a novel set of metrics rooted in syntax tree analysis. We evaluate leading open source coding agents on SWE-PolyBench, revealing their strengths and limitations across languages, task types, and complexity classes. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering. Our datasets and code are available at: https://github.com/amazon-science/SWE-PolyBench
中文摘要:SWE-PolyBench是一个新的多语言基准测试平台,通过基于执行的评估和创新的语法树指标,揭示了当前编程智能体在不同编程语言和任务难度下表现不均的问题。
English Summary: SWE-PolyBench is a new multi-language benchmark for automated evaluation of coding agents, revealing their uneven performance across programming languages and difficulty levels through execution-based testing and novel syntax tree metrics.

Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
Title: Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Abstract:
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.
中文总结:Genius是一种无监督自训练框架,通过逐步前瞻重采样和优势校准优化技术,无需外部监督即可提升大语言模型的推理能力,有效解决了扩展性和标注成本问题。
English Summary: Genius is an unsupervised self-training framework that enhances LLM reasoning through stepwise foresight re-sampling and advantage-calibrated optimization, eliminating the need for external supervision while improving scalability.

Authors:Ian Noronha, Advait Prasad Jawaji, Juan Camilo Soto, Jiajun An, Yan Gu, Upinder Kaur
Title: MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction
Abstract:
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.
中文: MBE-ARI数据集通过提供奶牛与机器人交互的多模态同步数据及高精度四足动物姿态估计模型,填补了动物-机器人交互研究领域的资源空白,为推进双向通信研究奠定了坚实基础。
English: The MBE-ARI dataset addresses the lack of resources in animal-robot interaction by providing synchronized multimodal data of cow-robot interactions and a high-precision quadruped pose estimation model, establishing a foundation for advancing bidirectional communication research.

Authors:Gabriele Lozupone, Alessandro Bria, Francesco Fontanella, Frederick J. A. Meijer, Claudio De Stefano, Henkjan Huisman
Title: Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging
Abstract:
This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM > 0.93, MSE < 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at https://github.com/GabrieleLozupone/LDAE
中文: 本研究提出的潜在扩散自编码器在压缩潜在空间应用扩散过程,实现了高效的3D医学影像表征学习,不仅对阿尔茨海默病展现出优异诊断性能,还具备高质量图像重建能力和显著提升的计算效率。
English: This study introduces the Latent Diffusion Autoencoder (LDAE), a novel framework that applies diffusion processes in compressed latent space to achieve efficient 3D medical imaging representation learning, demonstrating strong diagnostic performance for Alzheimer's disease and high-quality image reconstruction with significantly improved computational efficiency.

Authors:Renu Sharma, Debasmita Pal, Arun Ross
Title: Task-conditioned Ensemble of Expert Models for Continuous Learning
Abstract:
One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at https://github.com/iPRoBe-lab/Continuous_Learning_FE_DM.
中文: 本研究针对非稳态环境中模型精度下降的问题,提出了一种基于任务条件的集成方法,通过局部离群值概念动态组合专家模型,使模型在适应新数据的同时保持对原有数据的性能。
English: This work addresses the challenge of maintaining model accuracy in non-stationary environments by proposing a task-conditioned ensemble method that dynamically combines expert models using local outlier concepts to adapt to new data while preserving performance on old data.

Authors:Matteo Spanio, Antonio RodÃ
Title: TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration
Abstract:
The burgeoning complexity and real-time processing demands of audio signals necessitate optimized algorithms that harness the computational prowess of Graphics Processing Units (GPUs). Existing Digital Signal Processing (DSP) libraries often fall short in delivering the requisite efficiency and flexibility, particularly in integrating Artificial Intelligence (AI) models. In response, we introduce TorchFX: a GPU-accelerated Python library for DSP, specifically engineered to facilitate sophisticated audio signal processing. Built atop the PyTorch framework, TorchFX offers an Object-Oriented interface that emulates the usability of torchaudio, enhancing functionality with a novel pipe operator for intuitive filter chaining. This library provides a comprehensive suite of Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, with a focus on multichannel audio files, thus facilitating the integration of DSP and AI-based approaches. Our benchmarking results demonstrate significant efficiency gains over traditional libraries like SciPy, particularly in multichannel contexts. Despite current limitations in GPU compatibility, ongoing developments promise broader support and real-time processing capabilities. TorchFX aims to become a useful tool for the community, contributing to innovation and progress in DSP with GPU acceleration. TorchFX is publicly available on GitHub at https://github.com/matteospanio/torchfx.
中文摘要:TorchFX是基于PyTorch的GPU加速Python音频处理库,通过面向对象设计和创新的管道运算符实现高效滤波器链式操作,在多项测试中显著超越SciPy等传统库的性能表现。
English Summary: TorchFX is a GPU-accelerated Python library built on PyTorch that provides efficient audio signal processing with object-oriented design and enhanced filter chaining capabilities, demonstrating superior performance over traditional libraries like SciPy.

Authors:Tao Zhang, Zhenhai Liu, Yong Xin, Yongjun Jiao
Title: MooseAgent: A LLM Based Multi-agent Framework for Automating Moose Simulation
Abstract:
The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre-processing, solver configuration, and post-processing stages are often time-consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi-physics simulation framework MOOSE, which combines large-scale pre-trained language models (LLMs) with a multi-agent system. The framework uses LLMs to understand user-described simulation requirements in natural language and employs task decomposition and multi-round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi-physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single-physics problems. The main contribution of this research is the proposal of a multi-agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open-sourced and is available at https://github.com/taozhan18/MooseAgent
中文: 本文提出MooseAgent自动化框架,通过结合大语言模型与多智能体系统,能够理解自然语言描述的仿真需求并自动生成MOOSE输入文件,有效降低了有限元模拟的使用门槛。
English: This paper introduces MooseAgent, an automated framework for the MOOSE multi-physics simulation platform that leverages large language models and multi-agent systems to interpret natural language requirements and generate simulation input files, demonstrating effectiveness in simplifying finite element analysis workflows.

Authors:Gesina Schwalbe, Georgii Mikriukov, Edgar Heinert, Stavros Gerolymatos, Mert Keser, Alois Knoll, Matthias Rottmann, Annika Mütze
Title: On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs
Abstract:
The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on >50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.
Chinese: 基于概念的可解释人工智能方法常从训练数据中无意识地学习背景偏差,导致物体出现在非典型背景场景时性能不可靠,这一点已通过跨多个数据集和神经网络架构的系统性测试得到验证。
English: Concept-based explainable AI methods often unintentionally learn background biases from training data, leading to unreliable performance in scenarios where objects appear against atypical backgrounds, as demonstrated through systematic testing across multiple datasets and neural network architectures.

Authors:Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, Jian Guo
Title: SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Abstract:
Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Despite recent advancements in enhancing human-computer interaction within database applications, significant challenges persist, particularly regarding the reasoning performance in complex scenarios involving multi-table joins and nested queries. Current methodologies primarily utilize supervised fine-tuning~(SFT) to train the NL2SQL model, which may limit adaptability and interpretability in new environments~(e.g., finance and healthcare). In order to enhance the reasoning performance of the NL2SQL model in the above complex situations, we introduce SQL-R1, a novel NL2SQL reasoning model trained by the reinforcement learning~(RL) algorithms. We design a specialized RL-based reward function tailored for NL2SQL tasks and discussed the impact of cold start and synthetic data on the effectiveness of intensive training. In addition, we achieve competitive accuracy using only a tiny amount of synthetic NL2SQL data for augmented training and further explore data engineering for RL. In existing experiments, SQL-R1 achieves execution accuracy of 88.6\% and 67.1\% on the benchmark Spider and BIRD, respectively. The code is available at https://github.com/IDEA-FinAI/SQL-R1 .
Chinese: SQL-R1是一种基于强化学习训练的新型NL2SQL模型,旨在提升多表连接和嵌套查询等复杂场景下的推理性能,仅用少量合成数据即在基准测试中取得了优异准确率。
English: SQL-R1 is a novel NL2SQL model trained with reinforcement learning to improve reasoning performance in complex scenarios like multi-table joins and nested queries, achieving competitive accuracy on benchmarks with minimal synthetic data.

Authors:Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi, Raquel Fernández, Alexander Koller, Oliver Lemon, David Schlangen, Mario Giulianelli, Alessandro Suglia
Title: Playpen: An Environment for Exploring Learning Through Conversational Interaction
Abstract:
Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model's response. In this paper, we investigate whether Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can also serve as a source of feedback signals for learning. We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with GRPO. We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in the promising new direction of learning in (synthetic) interaction.
中文: 本研究探讨将对话游戏作为大语言模型后训练的反馈来源,通过自博弈学习环境Playpen发现交互式强化学习(GRPO)能在保持各项技能的同时实现均衡的性能提升。
English: This study explores using Dialogue Games as a feedback source for post-training LLMs, introducing Playpen for self-play learning and finding that interactive reinforcement learning (GRPO) achieves balanced skill improvements without degradation.

Authors:Vassili Korotkine, Mitchell Cohen, James Richard Forbes
Title: Globally Optimal Data-Association-Free Landmark-Based Localization Using Semidefinite Relaxations
Abstract:
This paper proposes a semidefinite relaxation for landmark-based localization with unknown data associations in planar environments. The proposed method simultaneously solves for the optimal robot states and data associations in a globally optimal fashion. Relative position measurements to known landmarks are used, but the data association is unknown in tha tthe robot does not know which landmark each measurement is generated from. The relaxation is shown to be tight in a majority of cases for moderate noise levels. The proposed algorithm is compared to local Gauss-Newton baselines initialized at the dead-reckoned trajectory, and is shown to significantly improve convergence to the problem's global optimum in simulation and experiment. Accompanying software and supplementary material may be found at https://github.com/decargroup/certifiable_uda_loc .
中文: 本文提出了一种半定松弛方法,用于在未知数据关联的情况下实现全局最优的基于地标的定位,该方法在中等噪声水平下具有紧密性,并在仿真和实验中显著提升了全局最优解的收敛性能。
English: This paper introduces a semidefinite relaxation method for globally optimal landmark-based localization with unknown data associations, demonstrating tightness under moderate noise and superior convergence to the global optimum compared to baseline methods.

Authors:Ye Ye
Title: Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks
Abstract:
Large Language Models (LLMs) are increasingly used as autonomous agents for multi-step tasks. However, most existing frameworks fail to maintain a structured understanding of the task state, often relying on linear prompt concatenation or shallow memory buffers. This leads to brittle performance, frequent hallucinations, and poor long-range coherence. In this work, we propose the Task Memory Engine (TME), a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT). Each node in the tree corresponds to a task step, storing relevant input, output, status, and sub-task relationships. We introduce a prompt synthesis method that dynamically generates LLM prompts based on the active node path, significantly improving execution consistency and contextual grounding. Through case studies and comparative experiments on multi-step agent tasks, we demonstrate that TME leads to better task completion accuracy and more interpretable behavior with minimal implementation overhead. A reference implementation of the core TME components is available at https://github.com/biubiutomato/TME-Agent, including basic examples and structured memory integration. While the current implementation uses a tree-based structure, TME is designed to be graph-aware, supporting reusable substeps, converging task paths, and shared dependencies. This lays the groundwork for future DAG-based memory architectures.
Chinese: 本文提出任务记忆引擎(TME),通过分层任务记忆树结构追踪多步骤任务执行状态并动态生成提示,以最小实现成本显著提升大语言模型代理的任务完成准确性和可解释性。
English: This paper introduces the Task Memory Engine (TME), a structured memory module that uses a hierarchical tree to track multi-step task execution and dynamically generate prompts, improving LLM agent performance with minimal overhead.

Authors:Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens
Title: A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images
Abstract:
In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode's decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
中文: 本文提出了一种可解释的CNN-Transformer混合模型,用于视网膜疾病检测,该模型在实现最先进预测性能的同时,能生成局部化的证据图。
English: This paper introduces an interpretable hybrid CNN-Transformer model for retinal disease detection that generates localized evidence maps while achieving state-of-the-art predictive performance on color fundus images.

Authors:Yi Chen, Tianchen Deng, Wentao Zhao, Xiaoning Wang, Wenqian Xi, Weidong Chen, Jingchuan Wang
Title: SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis
Abstract:
Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on https://github.com/dtc111111/SN-Lidar.
中文:针对激光雷达点云新视角合成研究中常忽略语义标签的问题,本文提出SN-LiDAR方法,通过从粗到精的平面网格特征表征和基于CNN的局部特征提取,同步实现精确语义分割、高质量几何重建与逼真激光雷达合成,在多个基准数据集上展现出卓越性能。
English: Recent research on LiDAR point cloud novel view synthesis often neglects semantic labels, but the proposed SN-LiDAR method jointly achieves accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis through a coarse-to-fine feature representation and CNN-based local feature extraction, demonstrating superior performance on benchmark datasets.

Authors:Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, Xiongkuo Min
Title: LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Abstract:
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.
Chinese: 针对大型多模态模型图像生成的评估难题,我们提出了包含大规模人工标注的综合基准EvalMi-50K,并开发了基于LMM的评估指标LMM4LMM,该指标在感知质量与图文对齐度评估中表现出卓越性能与泛化能力。
English: To address the limitations in evaluating large multimodal models' image generation, we introduce EvalMi-50K, a comprehensive benchmark with extensive human annotations, and propose LMM4LMM, an LMM-based metric that demonstrates superior performance and generalization in assessing perceptual quality and text-image alignment.

Authors:Lishuang Wang, Mengfei Zhao, Enyu Liu, Kebin Sun, Ran Cheng
Title: TensorNEAT: A GPU-accelerated Library for NeuroEvolution of Augmenting Topologies
Abstract:
The NeuroEvolution of Augmenting Topologies (NEAT) algorithm has received considerable recognition in the field of neuroevolution. Its effectiveness is derived from initiating with simple networks and incrementally evolving both their topologies and weights. Although its capability across various challenges is evident, the algorithm's computational efficiency remains an impediment, limiting its scalability potential. To address these limitations, this paper introduces TensorNEAT, a GPU-accelerated library that applies tensorization to the NEAT algorithm. Tensorization reformulates NEAT's diverse network topologies and operations into uniformly shaped tensors, enabling efficient parallel execution across entire populations. TensorNEAT is built upon JAX, leveraging automatic function vectorization and hardware acceleration to significantly enhance computational efficiency. In addition to NEAT, the library supports variants such as CPPN and HyperNEAT, and integrates with benchmark environments like Gym, Brax, and gymnax. Experimental evaluations across various robotic control environments in Brax demonstrate that TensorNEAT delivers up to 500x speedups compared to existing implementations, such as NEAT-Python. The source code for TensorNEAT is publicly available at: https://github.com/EMI-Group/tensorneat.
Chinese Summary: 本文提出TensorNEAT,一个基于JAX的GPU加速库,通过张量化NEAT算法实现种群级并行计算,相比现有实现速度提升高达500倍。
English Summary: The paper introduces TensorNEAT, a GPU-accelerated library built on JAX that tensorizes the NEAT algorithm to achieve up to 500x speedup over existing implementations by enabling parallel execution across populations.

Authors:Shuaiyu Xie, Jian Wang, Yang Luo, Yunqing Yong, Yuzhen Tan, Bing Li
Title: ScalerEval: Automated and Consistent Evaluation Testbed for Auto-scalers in Microservices
Abstract:
Auto-scaling is an automated approach that dynamically provisions resources for microservices to accommodate fluctuating workloads. Despite the introduction of many sophisticated auto-scaling algorithms, evaluating auto-scalers remains time-consuming and labor-intensive, as it requires the implementation of numerous fundamental interfaces, complex manual operations, and in-depth domain knowledge. Besides, frequent human intervention can inevitably introduce operational errors, leading to inconsistencies in the evaluation of different auto-scalers. To address these issues, we present ScalerEval, an end-to-end automated and consistent testbed for auto-scalers in microservices. ScalerEval integrates essential fundamental interfaces for implementation of auto-scalers and further orchestrates a one-click evaluation workflow for researchers. The source code is publicly available at \href{https://github.com/WHU-AISE/ScalerEval}{https://github.com/WHU-AISE/ScalerEval}.
中文: ScalerEval 是一个端到端的自动化测试平台,通过集成关键接口和实现一键式评估流程,解决了微服务自动扩缩容评估中人工操作繁琐和结果不一致的问题。
English: ScalerEval is an end-to-end automated testbed that streamlines the evaluation of auto-scalers for microservices by integrating essential interfaces and enabling one-click workflows, overcoming the challenges of manual effort and inconsistency.

Authors:Jinghe Yang, Mingming Gong, Ye Pu
Title: Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images
Abstract:
Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LiDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textit{marine snow}. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction and matching models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, VSLAM, as a representative downstream application of feature extraction and matching, is employed on real underwater sequences to validate the effectiveness of the transferred model. Project page: https://github.com/Jinghe-mel/UFEN-GAN.
中文摘要:本文通过跨模态知识蒸馏方法,利用自适应GAN合成技术生成水下合成图像,将空中特征提取与匹配模型迁移至浑浊水下环境,有效提升了水下特征提取与匹配的鲁棒性。
English Summary: This paper enhances feature extraction and matching in turbid underwater environments through cross-modal knowledge distillation, using synthetic underwater images generated by an adaptive GAN-synthesis method to transfer in-air models to underwater settings.

Authors:Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong
Title: F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
Abstract:
Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.
中文: F³Set基准通过提供大规模数据集并引入F³ED新方法,解决了视频中快速、频繁和细粒度事件检测的难题,其性能优于现有技术。
English: The F³Set benchmark addresses the challenge of detecting fast, frequent, and fine-grained events in videos by providing large-scale datasets and introducing F³ED, a novel method that outperforms existing techniques.

Authors:Guangcong Zheng, Teng Li, Xianpan Zhou, Xi Li
Title: RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements
Abstract:
Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.
中文: 本文指出当前基于静态场景数据集的相机可控视频生成方法存在局限,并推出了首个开源的高分辨率动态场景数据集,具备公制尺度相机标注,以提升复杂环境中物体运动真实性和相机轨迹精确性。
English: This summary highlights the limitations of current camera-controllable video generation methods that rely on static-scene datasets and introduces a new open-source dynamic-scene dataset with metric-scale camera annotations to enable more realistic motion synthesis and precise camera control.

Authors:Eleanor Wallach, Sage Siler, Jing Deng
Title: The More is not the Merrier: Investigating the Effect of Client Size on Federated Learning
Abstract:
Federated Learning (FL) has been introduced as a way to keep data local to clients while training a shared machine learning model, as clients train on their local data and send trained models to a central aggregator. It is expected that FL will have a huge implication on Mobile Edge Computing, the Internet of Things, and Cross-Silo FL. In this paper, we focus on the widely used FedAvg algorithm to explore the effect of the number of clients in FL. We find a significant deterioration of learning accuracy for FedAvg as the number of clients increases. To address this issue for a general application, we propose a method called Knowledgeable Client Insertion (KCI) that introduces a very small number of knowledgeable clients to the MEC setting. These knowledgeable clients are expected to have accumulated a large set of data samples to help with training. With the help of KCI, the learning accuracy of FL increases much faster even with a normal FedAvg aggregation technique. We expect this approach to be able to provide great privacy protection for clients against security attacks such as model inversion attacks. Our code is available at https://github.com/Eleanor-W/KCI_for_FL.
中文: 联邦学习(FL)通过本地数据训练共享模型,但常用的FedAvg算法在客户端数量增加时精度显著下降,为此提出的知识型客户端插入(KCI)方法通过引入少量知识型客户端,有效提升了学习速度并增强了隐私保护。
English: Federated Learning (FL) trains a shared model by keeping data local, but the widely used FedAvg algorithm suffers from accuracy decline as client numbers grow, which is addressed by the proposed Knowledgeable Client Insertion (KCI) method that enhances learning speed and privacy protection.

Authors:Danielle Sullivan-Pao, Nicole Tian, Pooya Khorrami
Title: LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution
Abstract:
As generative AI image technologies become more widespread and advanced, there is a growing need for strong attribution models. These models are crucial for verifying the authenticity of images and identifying the architecture of their originating generative models-key to maintaining media integrity. However, attribution models struggle to generalize to unseen models, and traditional fine-tuning methods for updating these models have shown to be impractical in real-world settings. To address these challenges, we propose LoRA eXpandable Networks (LoRAX), a parameter-efficient class incremental algorithm that adapts to novel generative image models without the need for full retraining. Our approach trains an extremely parameter-efficient feature extractor per continual learning task via Low Rank Adaptation. Each task-specific feature extractor learns distinct features while only requiring a small fraction of the parameters present in the underlying feature extractor's backbone model. Our extensive experimentation shows LoRAX outperforms or remains competitive with state-of-the-art class incremental learning algorithms on the Continual Deepfake Detection benchmark across all training scenarios and memory settings, while requiring less than 3% of the number of trainable parameters per feature extractor compared to the full-rank implementation. LoRAX code is available at: https://github.com/mit-ll/lorax_cil.
中文: 针对生成式AI图像溯源模型难以适应新型模型的挑战,我们提出LoRAX参数高效增量学习算法,该方案在持续深度伪造检测任务中表现卓越,且每个特征提取器所需训练参数不足全量训练的3%。
English: To address the limitations of attribution models in adapting to new generative AI image technologies, we introduce LoRAX, a parameter-efficient incremental learning algorithm that excels in continual deepfake detection with minimal trainable parameters.

Authors:Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
Title: DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Abstract:
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.
中文: 本研究首次评估了具备推理能力的大语言模型在自然语言生成任务中的表现,发现其相对于非推理模型的优势因架构和任务而异,同时证明模型蒸馏在32B参数规模内仍能保持良好性能。
English: This study pioneers the evaluation of reasoning-enabled large language models for assessing natural language generation tasks, revealing that their performance advantages over non-reasoning models vary by architecture and task while demonstrating that model distillation remains effective down to 32B parameters.

Authors:Lucian Chauvin, Somil Gupta, Angelina Ibarra, Joshua Peeples
Title: Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms
Abstract:
Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/SARIAD.
中文: SARIAD是一个综合性工具包,集成了合成孔径雷达(SAR)图像异常检测的数据集与算法,为该领域的方法开发、性能评估及可重复研究提供了统一平台。
English: SARIAD is a comprehensive toolkit that integrates datasets and algorithms for developing and benchmarking anomaly detection methods in synthetic aperture radar (SAR) imagery, enabling reproducible research in the field.

Authors:Ingryd V. S. T. Pereira, George D. C. Cavalcanti, Rafael M. O. Cruz
Title: Multi-view autoencoders for Fake News Detection
Abstract:
Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.
Chinese: 本文提出了一种多视图自编码器方法,通过整合多种特征提取技术来提升虚假新闻检测效果,实验表明选择性融合互补文本特征可显著提高分类性能并优化计算效率。
English: This paper proposes a multi-view autoencoder approach that integrates multiple feature extraction techniques to enhance fake news detection, achieving improved classification performance and efficiency by selectively combining complementary textual features.

Authors:Junbang Liu, Enpei Huang, Dongxing Mao, Hui Zhang, Xinyuan Song, Yongxin Ni
Title: ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting
Abstract:
Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.
中文: 本文提出ContrastiveGaussian方法,通过将对比学习和感知损失融入生成过程,有效利用扩散模型输出的视觉不一致性来提升单图像3D生成质量,实现了更优的纹理保真度和几何一致性。
English: This paper introduces ContrastiveGaussian, a method that enhances 3D generation from single images by integrating contrastive learning and perceptual loss to address visual inconsistencies in diffusion model outputs, achieving superior texture fidelity and geometric consistency.

Authors:Chengyu Yang, Chengjun Liu
Title: Interpretable Automatic Rosacea Detection with Whitened Cosine Similarity
Abstract:
According to the National Rosacea Society, approximately sixteen million Americans suffer from rosacea, a common skin condition that causes flushing or long-term redness on a person's face. To increase rosacea awareness and to better assist physicians to make diagnosis on this disease, we propose an interpretable automatic rosacea detection method based on whitened cosine similarity in this paper. The contributions of the proposed methods are three-fold. First, the proposed method can automatically distinguish patients suffering from rosacea from people who are clean of this disease with a significantly higher accuracy than other methods in unseen test data, including both classical deep learning and statistical methods. Second, the proposed method addresses the interpretability issue by measuring the similarity between the test sample and the means of two classes, namely the rosacea class versus the normal class, which allows both medical professionals and patients to understand and trust the results. And finally, the proposed methods will not only help increase awareness of rosacea in the general population, but will also help remind patients who suffer from this disease of possible early treatment, as rosacea is more treatable in its early stages. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025.
Chinese: 本文提出了一种基于白化余弦相似度的可解释自动红斑痤疮检测方法,其准确率优于现有方法,有助于提升医生诊断能力并增强公众对该疾病的认识。
English: This paper introduces an interpretable automatic rosacea detection method using whitened cosine similarity, achieving higher accuracy than existing approaches and enhancing both physician diagnosis and public awareness of the condition.

Authors:Sushant Gautam, Jingdao Chen
Title: X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization
Abstract:
Restoring severely blurred images remains a significant challenge in computer vision, impacting applications in autonomous driving, medical imaging, and photography. This paper introduces a novel training strategy based on curriculum learning to improve the robustness of deep learning models for extreme image deblurring. Unlike conventional approaches that train on only low to moderate blur levels, our method progressively increases the difficulty by introducing images with higher blur severity over time, allowing the model to adapt incrementally. Additionally, we integrate perceptual and hinge loss during training to enhance fine detail restoration and improve training stability. We experimented with various curriculum learning strategies and explored the impact of the train-test domain gap on the deblurring performance. Experimental results on the Extreme-GoPro dataset showed that our method outperforms the next best method by 14% in SSIM, whereas experiments on the Extreme-KITTI dataset showed that our method outperforms the next best by 18% in SSIM. Ablation studies showed that a linear curriculum progression outperforms step-wise, sigmoid, and exponential progressions, while hyperparameter settings such as the training blur percentage and loss function formulation all play important roles in addressing extreme blur artifacts. Datasets and code are available at https://github.com/RAPTOR-MSSTATE/XDECODE
Chinese: 本文提出一种基于课程学习的训练策略,通过渐进增加模糊程度并结合感知与铰链损失,在极端图像去模糊任务中取得显著突破,在基准数据集上的SSIM指标比现有最佳方法高出14-18%。
English: This paper proposes a curriculum learning-based training strategy that progressively increases blur severity and integrates perceptual and hinge loss, achieving significant improvements in extreme image deblurring with 14-18% higher SSIM scores than existing methods on benchmark datasets.

Authors:Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, David Ha
Title: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Abstract:
AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.
中文: AI科学家-v2系统首次实现了完全由人工智能生成且通过同行评审的学术论文,标志着人工智能已具备自主开展完整科学研究的能力。
English: The AI Scientist-v2 is an autonomous system that successfully produced the first fully AI-generated peer-review-accepted scientific paper, demonstrating AI's growing capability to conduct end-to-end research without human intervention.

Authors:Tony Shen, Seonghwan Seo, Ross Irwin, Kieran Didi, Simon Olsson, Woo Youn Kim, Martin Ester
Title: Compositional Flows for 3D Molecule and Synthesis Pathway Co-design
Abstract:
Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8$\times$ improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2\%) on the CrossDocked benchmark.
中文: CGFlow是一种新颖框架,通过扩展流匹配技术实现连续特征组合对象的逐步生成与奖励引导采样,在可合成药物设计中达到顶尖性能并显著提升采样效率。
English: CGFlow is a novel framework that extends flow matching to generate compositional objects with continuous features through step-by-step construction and reward-guided sampling, achieving state-of-the-art performance in synthesizable drug design with significantly improved efficiency.

Authors:Angelina Ibarra, Joshua Peeples
Title: Patch distribution modeling framework adaptive cosine estimator (PaDiM-ACE) for anomaly detection and localization in synthetic aperture radar imagery
Abstract:
This work presents a new approach to anomaly detection and localization in synthetic aperture radar imagery (SAR), expanding upon the existing patch distribution modeling framework (PaDiM). We introduce the adaptive cosine estimator (ACE) detection statistic. PaDiM uses the Mahalanobis distance at inference, an unbounded metric. ACE instead uses the cosine similarity metric, providing bounded anomaly detection scores. The proposed method is evaluated across multiple SAR datasets, with performance metrics including the area under the receiver operating curve (AUROC) at the image and pixel level, aiming for increased performance in anomaly detection and localization of SAR imagery. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/PaDiM-ACE.
中文: 本研究提出自适应余弦估计器(ACE),通过用有界的余弦相似度替代PaDiM中无界的马氏距离,增强了合成孔径雷达图像异常检测与定位能力,在多个数据集上通过AUROC指标验证了性能提升。
English: This study introduces an adaptive cosine estimator (ACE) to enhance SAR image anomaly detection and localization by replacing PaDiM's unbounded Mahalanobis distance with bounded cosine similarity, demonstrating improved performance across multiple datasets through AUROC metrics.

Authors:Miguel López-Otal, Jorge Gracia, Jordi Bernad, Carlos Bobed, Lucía Pitarch-Ballesteros, Emma Anglés-Herrero
Title: Linguistic Interpretability of Transformer-based Language Models: a systematic review
Abstract:
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
中文: 本综述系统分析了160项研究,通过考察多语言Transformer模型在句法、形态、词汇语义及语篇层面的内部表征,填补了现有可解释性研究聚焦英语模型或忽视语言知识的空白。
English: This survey comprehensively analyzes 160 studies exploring how Transformer-based language models encode linguistic knowledge across syntax, morphology, semantics, and discourse, addressing gaps in interpretability research by examining multilingual models beyond English-specific limitations.

Authors:Nian Wu, Nivetha Jayakumar, Jiarui Xing, Miaomiao Zhang
Title: IGG: Image Generation Informed by Geodesic Dynamics in Deformation Spaces
Abstract:
Generative models have recently gained increasing attention in image generation and editing tasks. However, they often lack a direct connection to object geometry, which is crucial in sensitive domains such as computational anatomy, biology, and robotics. This paper presents a novel framework for Image Generation informed by Geodesic dynamics (IGG) in deformation spaces. Our IGG model comprises two key components: (i) an efficient autoencoder that explicitly learns the geodesic path of image transformations in the latent space; and (ii) a latent geodesic diffusion model that captures the distribution of latent representations of geodesic deformations conditioned on text instructions. By leveraging geodesic paths, our method ensures smooth, topology-preserving, and interpretable deformations, capturing complex variations in image structures while maintaining geometric consistency. We validate the proposed IGG on plant growth data and brain magnetic resonance imaging (MRI). Experimental results show that IGG outperforms the state-of-the-art image generation/editing models with superior performance in generating realistic, high-quality images with preserved object topology and reduced artifacts. Our code is publicly available at https://github.com/nellie689/IGG.
中文摘要:本文提出基于变形空间测地动力学的IGG图像生成框架,通过测地路径实现平滑、保拓扑的图像变换,在植物生长和脑部MRI数据上验证了其优于现有方法的生成质量与几何一致性保持能力。
English Summary: This paper introduces the IGG framework, which uses geodesic dynamics in deformation spaces to generate and edit images with smooth, topology-preserving transformations, validated on plant growth and brain MRI data with superior results over existing methods.

Authors:Biplav Srivastava, Kausik Lakkaraju, Nitin Gupta, Vansh Nagpal, Bharath C. Muppasani, Sara E. Jones
Title: SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness
Abstract:
Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: https://github.com/ai4society/trustworthy-chatbot.
中文: 协作式聊天机器人存在可靠性问题,因此推出了SafeChat架构,通过可追溯来源的响应和自动信任评估,确保在选举和医疗等敏感领域的安全可信应用。
English: Collaborative chatbots like ChatGPT face trust issues due to limitations in explainability and safety, prompting the development of SafeChat—a secure architecture ensuring traceable, source-grounded responses for reliable applications such as elections and healthcare.

Authors:Xuan-Hao Liu, Bao-Liang Lu, Wei-Long Zheng
Title: mixEEG: Enhancing EEG Federated Learning for Cross-subject EEG Classification with Tailored mixup
Abstract:
The cross-subject electroencephalography (EEG) classification exhibits great challenges due to the diversity of cognitive processes and physiological structures between different subjects. Modern EEG models are based on neural networks, demanding a large amount of data to achieve high performance and generalizability. However, privacy concerns associated with EEG pose significant limitations to data sharing between different hospitals and institutions, resulting in the lack of large dataset for most EEG tasks. Federated learning (FL) enables multiple decentralized clients to collaboratively train a global model without direct communication of raw data, thus preserving privacy. For the first time, we investigate the cross-subject EEG classification in the FL setting. In this paper, we propose a simple yet effective framework termed mixEEG. Specifically, we tailor the vanilla mixup considering the unique properties of the EEG modality. mixEEG shares the unlabeled averaged data of the unseen subject rather than simply sharing raw data under the domain adaptation setting, thus better preserving privacy and offering an averaged label as pseudo-label. Extensive experiments are conducted on an epilepsy detection and an emotion recognition dataset. The experimental result demonstrates that our mixEEG enhances the transferability of global model for cross-subject EEG classification consistently across different datasets and model architectures. Code is published at: https://github.com/XuanhaoLiu/mixEEG.
中文:提出的mixEEG框架通过共享平均未标记数据而非原始数据,在联邦学习中解决了跨被试脑电分类的难题,在不同数据集和架构下增强模型可迁移性的同时有效保护了隐私。
English: The proposed mixEEG framework addresses cross-subject EEG classification challenges in federated learning by sharing averaged unlabeled data instead of raw data, enhancing model transferability while preserving privacy across different datasets and architectures.

Authors:Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
Title: SEAL: Steerable Reasoning Calibration of Large Language Models for Free
Abstract:
Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.
中文: SEAL是一种无需训练的方法,通过潜在空间导向向量校准大语言模型的思维链推理过程,在将推理标记减少11.8%至50.4%的同时,实现了最高11%的准确率提升。
English: SEAL is a training-free method that enhances the accuracy and efficiency of large language models by calibrating their chain-of-thought reasoning through latent space steering vectors, achieving up to 11% higher accuracy while reducing reasoning tokens by 11.8% to 50.4%.

Authors:Hamidreza Eivazi, Jendrik-Alexander Tröger, Stefan Wittek, Stefan Hartmann, Andreas Rausch
Title: EquiNO: A Physics-Informed Neural Operator for Multiscale Simulations
Abstract:
Multiscale problems are ubiquitous in physics. Numerical simulations of such problems by solving partial differential equations (PDEs) at high resolution are computationally too expensive for many-query scenarios, e.g., uncertainty quantification, remeshing applications, topology optimization, and so forth. This limitation has motivated the application of data-driven surrogate models, where the microscale computations are $\textit{substituted}$ with a surrogate, usually acting as a black-box mapping between macroscale quantities. These models offer significant speedups but struggle with incorporating microscale physical constraints, such as the balance of linear momentum and constitutive models. In this contribution, we propose Equilibrium Neural Operator (EquiNO) as a $\textit{complementary}$ physics-informed PDE surrogate for predicting microscale physics and compare it with variational physics-informed neural and operator networks. Our framework, applicable to the so-called multiscale FE$^{\,2}\,$ computations, introduces the FE-OL approach by integrating the finite element (FE) method with operator learning (OL). We apply the proposed FE-OL approach to quasi-static problems of solid mechanics. The results demonstrate that FE-OL can yield accurate solutions even when confronted with a restricted dataset during model development. Our results show that EquiNO achieves speedup factors exceeding 8000-fold compared to traditional methods and offers an optimal balance between data-driven and physics-based strategies.
Chinese: 本文提出平衡神经算子(EquiNO),这是一种融合有限元方法与算子学习的物理信息代理模型,用于高效预测多尺度问题中的微观物理现象,在有限数据下仍保持精度,同时实现超过8000倍的加速效果。
English: This paper introduces the Equilibrium Neural Operator (EquiNO), a physics-informed surrogate model that integrates the finite element method with operator learning to efficiently predict microscale physics in multiscale problems, achieving over 8000-fold speedup while maintaining accuracy with limited data.

Authors:Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, Ping Luo
Title: PixelFlow: Pixel-Space Generative Models with Flow
Abstract:
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$\times$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.
中文: PixelFlow提出了一种直接在像素空间操作的图像生成模型系列,无需变分自编码器即可实现端到端训练,并通过高效级联流建模在图像质量和语义控制方面表现出色。
English: PixelFlow introduces a novel family of image generation models that operate directly in pixel space, eliminating the need for VAEs and achieving state-of-the-art performance with efficient cascade flow modeling.

Authors:Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: MM-IFEngine: Towards Multimodal Instruction Following
Abstract:
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.
中文: 本研究提出MM-IFEngine框架,通过生成高质量图像-指令数据集(MM-IFInstruct-23k与MM-IFDPO-23k)和MM-IFEval评估基准,显著提升了多模态大模型的指令跟随能力,在多个基准测试中取得突破性进展。
English: This study introduces MM-IFEngine, a pipeline generating high-quality image-instruction data (MM-IFInstruct-23k and MM-IFDPO-23k) and the MM-IFEval benchmark to enhance multimodal instruction-following in MLLMs, achieving significant performance improvements across multiple benchmarks.

Authors:En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao
Title: Perception-R1: Pioneering Perception Policy with Reinforcement Learning
Abstract:
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
中文: 本研究提出Perception-R1强化学习框架,通过优化感知复杂度处理与奖励机制设计,在多类视觉感知任务中实现显著性能提升,为感知策略学习建立了新基准。
English: This study introduces Perception-R1, a scalable reinforcement learning framework that enhances visual perception tasks by addressing perceptual complexity and reward design, achieving significant performance improvements across multiple benchmarks.

Authors:Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
Title: Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
Abstract:
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
中文: 动态备忘单(DC)是一种轻量级框架,为黑盒语言模型赋予持久记忆能力,使其能在推理时存储和复用解题策略,无需真实标签或人工反馈即可显著提升各类任务的表现。
English: Dynamic Cheatsheet (DC) is a lightweight framework that equips black-box language models with persistent memory, enabling them to store and reuse problem-solving insights at inference time, which substantially enhances performance across various tasks without requiring ground-truth labels or human feedback.

Authors:Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali
Title: SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Abstract:
Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves $1.4-3.0\times$ speedup over vanilla LRM inference while improving accuracy by $0.4-9.0\%$. Compared to speculative decoding without SpecReason, their combination yields an additional $8.8-58.0\%$ latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.
中文摘要:SpecReason系统通过使用轻量模型执行中间推理步骤、基础模型进行验证,有效加速大型推理模型,在提升速度的同时提高了准确性。
English Summary: SpecReason is a system that accelerates Large Reasoning Models by using a lightweight model for intermediate reasoning steps and the base model for verification, achieving significant speedup and improved accuracy.

Authors:Ben Cheng, Yize Chen
Title: Open Datasets for Grid Modeling and Visualization: An Alberta Power Network Case
Abstract:
In the power and energy industry, multiple entities in grid operational logs are frequently recorded and updated. Thanks to recent advances in IT facilities and smart metering services, a variety of datasets such as system load, generation mix, and grid connection are often publicly available. While these resources are valuable in evaluating power grid's operational conditions and system resilience, the lack of fine-grained, accurate locational information constrain the usage of current data, which further hinders the development of smart grid and renewables integration. For instance, electricity end users are not aware of nodal generation mix or carbon emissions, while the general public have limited understanding about the effect of demand response or renewables integration if only the whole system's demands and generations are available. In this work, we focus on recovering power grid topology and line flow directions from open public dataset. Taking the Alberta grid as a working example, we start from mapping multi-modal power system datasets to the grid topology integrated with geographical information. By designing a novel optimization-based scheme to recover line flow directions, we are able to analyze and visualize the interactions between generations and demand vectors in an efficient manner. Proposed research is fully open-sourced and highly generalizable, which can help model and visualize grid information, create synthetic dataset, and facilitate analytics and decision-making framework for clean energy transition.
中文摘要:本研究开发了一种基于公开数据的开源方法,通过创新优化方案重建电网拓扑与线路潮流方向,可有效支持电网建模分析并推动清洁能源转型决策。
English Summary: This study develops an open-source method to reconstruct power grid topology and line flow directions from public datasets, enabling enhanced grid modeling and clean energy transition analytics through a novel optimization approach.

Authors:Erin Carson, Xinye Chen
Title: Pychop: Emulating Low-Precision Arithmetic in Numerical Methods and Neural Networks
Abstract:
Motivated by the growing demand for low-precision arithmetic in computational science, we exploit lower-precision emulation in Python -- widely regarded as the dominant programming language for numerical analysis and machine learning. Low-precision training has revolutionized deep learning by enabling more efficient computation and reduced memory and energy consumption while maintaining model fidelity. To better enable numerical experimentation with and exploration of low precision computation, we developed the Pychop library, which supports customizable floating-point formats and a comprehensive set of rounding modes in Python, allowing users to benefit from fast, low-precision emulation in numerous applications. Pychop also introduces interfaces for both PyTorch and JAX, enabling efficient low-precision emulation on GPUs for neural network training and inference with unparalleled flexibility. In this paper, we offer a comprehensive exposition of the design, implementation, validation, and practical application of Pychop, establishing it as a foundational tool for advancing efficient mixed-precision algorithms. Furthermore, we present empirical results on low-precision emulation for image classification and object detection using published datasets, illustrating the sensitivity of the use of low precision and offering valuable insights into its impact. Pychop enables in-depth investigations into the effects of numerical precision, facilitates the development of novel hardware accelerators, and integrates seamlessly into existing deep learning workflows. Software and experimental code are publicly available at https://github.com/inEXASCALE/pychop.
中文: Pychop库在Python中实现了可定制的低精度算术仿真,支持PyTorch和JAX的GPU加速,为神经网络训练和硬件加速器开发提供了高效灵活的解决方案,同时保持模型精度。
English: The Pychop library enables flexible low-precision arithmetic emulation in Python with GPU support for PyTorch and JAX, facilitating efficient neural network training and hardware accelerator development while maintaining model accuracy.

Authors:Yifan Ding, Arturas Aleksandraus, Amirhossein Ahmadian, Jonas Unger, Fredrik Lindsten, Gabriel Eilertsen
Title: Revisiting Likelihood-Based Out-of-Distribution Detection by Modeling Representations
Abstract:
Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning systems, particularly in safety-critical applications. Likelihood-based deep generative models have historically faced criticism for their unsatisfactory performance in OOD detection, often assigning higher likelihood to OOD data than in-distribution samples when applied to image data. In this work, we demonstrate that likelihood is not inherently flawed. Rather, several properties in the images space prohibit likelihood as a valid detection score. Given a sufficiently good likelihood estimator, specifically using the probability flow formulation of a diffusion model, we show that likelihood-based methods can still perform on par with state-of-the-art methods when applied in the representation space of pre-trained encoders. The code of our work can be found at $\href{https://github.com/limchaos/Likelihood-OOD.git}{\texttt{https://github.com/limchaos/Likelihood-OOD.git}}$.
Chinese: 本研究证明,通过扩散模型在预训练编码器的表征空间中使用基于似然的方法,能够实现最先进的分布外检测性能,从而挑战了先前对似然方法的批评。
English: This study demonstrates that likelihood-based methods can achieve state-of-the-art out-of-distribution detection performance when applied in the representation space of pre-trained encoders using diffusion models, challenging previous criticisms of likelihood approaches.

Authors:Bo Zhang, Hui Ma, Dailin Li, Jian Ding, Jian Wang, Bo Xu, HongFei Lin
Title: Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation
Abstract:
Large language models (LLMs) demonstrate remarkable text comprehension and generation capabilities but often lack the ability to utilize up-to-date or domain-specific knowledge not included in their training data. To address this gap, we introduce KEDiT, an efficient method for fine-tuning LLMs for knowledge-grounded dialogue generation. KEDiT operates in two main phases: first, it employs an information bottleneck to compress retrieved knowledge into learnable parameters, retaining essential information while minimizing computational overhead. Second, a lightweight knowledge-aware adapter integrates these compressed knowledge vectors into the LLM during fine-tuning, updating less than 2\% of the model parameters. The experimental results on the Wizard of Wikipedia and a newly constructed PubMed-Dialog dataset demonstrate that KEDiT excels in generating contextually relevant and informative responses, outperforming competitive baselines in automatic, LLM-based, and human evaluations. This approach effectively combines the strengths of pretrained LLMs with the adaptability needed for incorporating dynamic knowledge, presenting a scalable solution for fields such as medicine.
中文: KEDiT通过将检索到的知识压缩为可学习参数,并利用轻量级适配器将其整合到大型语言模型中,以极少的参数更新实现了上下文相关且信息丰富的对话生成。
English: KEDiT efficiently fine-tunes large language models by compressing retrieved knowledge into learnable parameters and integrating them via a lightweight adapter, enabling contextually relevant dialogue generation with minimal parameter updates.

Authors:Yihao Wang, Zhong Qian, Peifeng Li
Title: FMNV: A Dataset of Media-Published News Videos for Fake News Detection
Abstract:
News media, particularly video-based platforms, have become deeply embed-ded in daily life, concurrently amplifying the risks of misinformation dissem-ination. Consequently, multimodal fake news detection has garnered signifi-cant research attention. However, existing datasets predominantly comprise user-generated videos characterized by crude editing and limited public en-gagement, whereas professionally crafted fake news videos disseminated by media outlets-often politically or virally motivated-pose substantially greater societal harm. To address this gap, we construct FMNV, a novel da-taset exclusively composed of news videos published by media organizations. Through empirical analysis of existing datasets and our curated collection, we categorize fake news videos into four distinct types. Building upon this taxonomy, we employ Large Language Models (LLMs) to automatically generate deceptive content by manipulating authentic media-published news videos. Furthermore, we propose FMNVD, a baseline model featuring a dual-stream architecture that integrates spatio-temporal motion features from a 3D ResNeXt-101 backbone and static visual semantics from CLIP. The two streams are fused via an attention-based mechanism, while co-attention modules refine the visual, textual, and audio features for effective multi-modal aggregation. Comparative experiments demonstrate both the generali-zation capability of FMNV across multiple baselines and the superior detec-tion efficacy of FMNVD. This work establishes critical benchmarks for de-tecting high-impact fake news in media ecosystems while advancing meth-odologies for cross-modal inconsistency analysis. Our dataset is available in https://github.com/DennisIW/FMNV.
中文:本研究提出了由专业制作的假新闻视频构成的新数据集FMNV,并开发了FMNVD双流检测模型,通过融合多模态特征有效应对高影响力虚假信息的社会危害。
English: This research introduces FMNV, a novel dataset of professionally produced fake news videos, and proposes FMNVD, a dual-stream detection model that effectively integrates multimodal features to address the societal harm of high-impact misinformation.

Authors:Anne-Sofie Maerten, Li-Wei Chen, Stefanie De Winter, Christophe Bossens, Johan Wagemans
Title: LAPIS: A novel dataset for personalized image aesthetic assessment
Abstract:
We present the Leuven Art Personalized Image Set (LAPIS), a novel dataset for personalized image aesthetic assessment (PIAA). It is the first dataset with images of artworks that is suitable for PIAA. LAPIS consists of 11,723 images and was meticulously curated in collaboration with art historians. Each image has an aesthetics score and a set of image attributes known to relate to aesthetic appreciation. Besides rich image attributes, LAPIS offers rich personal attributes of each annotator. We implemented two existing state-of-the-art PIAA models and assessed their performance on LAPIS. We assess the contribution of personal attributes and image attributes through ablation studies and find that performance deteriorates when certain personal and image attributes are removed. An analysis of failure cases reveals that both existing models make similar incorrect predictions, highlighting the need for improvements in artistic image aesthetic assessment. The LAPIS project page can be found at: https://github.com/Anne-SofieMaerten/LAPIS
中文: LAPIS是首个适用于个性化图像美学评估的艺术作品数据集,包含11,723张精心策展的图像,消融研究表明个人属性和图像属性对模型性能均至关重要,而现有模型在艺术图像美学评估中仍存在需要改进的系统性误判。
English: The LAPIS dataset is the first personalized image aesthetic assessment collection featuring 11,723 curated artworks, where ablation studies demonstrate that both personal and image attributes are crucial for model performance, while current models show consistent prediction errors requiring further refinement.

Authors:Xiaowu Zhang, Hongfei Zhao, Jingyi Hou, Zhijie Liu
Title: Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design
Abstract:
The Chinese Spelling Correction (CSC) task focuses on detecting and correcting spelling errors in sentences. Current research primarily explores two approaches: traditional multimodal pre-trained models and large language models (LLMs). However, LLMs face limitations in CSC, particularly over-correction, making them suboptimal for this task. While existing studies have investigated the use of phonetic and graphemic information in multimodal CSC models, effectively leveraging these features to enhance correction performance remains a challenge. To address this, we propose the Multimodal Analysis for Character Usage (\textbf{MACU}) experiment, identifying potential improvements for multimodal correctison. Based on empirical findings, we introduce \textbf{NamBert}, a novel multimodal model for Chinese spelling correction. Experiments on benchmark datasets demonstrate NamBert's superiority over SOTA methods. We also conduct a comprehensive comparison between NamBert and LLMs, systematically evaluating their strengths and limitations in CSC. Our code and model are available at https://github.com/iioSnail/NamBert.
中文: 本研究提出了NamBert这一新型多模态中文拼写纠错模型,通过有效利用语音和字形特征超越了现有方法,同时系统评估了大语言模型在此任务中的局限性。
English: The study introduces NamBert, a novel multimodal model for Chinese spelling correction that outperforms current methods by effectively leveraging phonetic and graphemic features, while also highlighting limitations of large language models in this task.

Authors:Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, Tiancheng Zhao
Title: VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Abstract:
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
中文: DeepSeek R1 基于确定性答案的规则奖励强化学习方法被成功扩展到视觉语言模型VLM-R1中,有效提升了视觉推理任务的性能表现和泛化能力。
English: DeepSeek R1's reinforcement learning approach, using rule-based rewards for tasks with clear answers, is successfully extended to vision-language models through VLM-R1, enhancing both performance and generalization in visual reasoning tasks.

Authors:Andrés Bell-Navas, María Villalba-Orero, Enrique Lara-Pezzi, Jesús Garicano-Mena, Soledad Le Clainche
Title: Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases
Abstract:
Heart diseases constitute the main cause of international human defunction. According to the World Health Organization (WHO), approximately 18 million deaths happen each year due to precisely heart diseases. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel deep learning framework which analyses in real-time echocardiography video sequences for the challenging and more specific task of heart failure time prediction. This system works in two stages. The first one transforms the data from a database of echocardiography video sequences into a machine learning-compatible collection of annotated images which can be used in the training phase of any machine learning-based framework, including a deep learning-based one. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). Self-supervised learning (SSL) methods, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses images from echocardiography sequences to estimate the time in which a heart failure will happen. The results obtained show the efficacy of the HODMD algorithm and the superiority of the proposed system with respect to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).
中文: 心脏病是全球主要死因,本研究提出一种基于新型深度学习框架的自动化系统,通过包含HODMD算法处理数据和视觉变换器预测的两阶段方法,实时分析超声心动图视频来预测心力衰竭的发生时间。
English: Heart disease is the leading global cause of death, and this study introduces an automated system using a novel deep learning framework that analyzes echocardiography videos in real-time to predict the timing of heart failure through a two-stage process involving HODMD for data processing and a Vision Transformer for prediction.

Authors:Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, André F. T. Martins, Graham Neubig
Title: Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Abstract:
Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa
中文: 现有机器翻译自动评估指标难以衡量跨句子的意义保留,因此我们提出TREQA框架,通过测试翻译文本对原文关键信息的阅读理解问题回答准确性来评估质量,在复杂领域表现优异且提供可解释性。
English: Current automatic metrics for machine translation evaluation often fail to assess meaning preservation beyond individual sentences, prompting the introduction of TREQA, a pragmatic framework that evaluates translations by testing how well they answer comprehension questions about key information in the source text, showing competitive performance and enhanced interpretability in complex domains.

Authors:Moritz Rempe, Fabian Hörst, Helmut Becker, Marco Schlimbach, Lukas Rotkopf, Kevin Kröninger, Jens Kleesiek
Title: PhaseGen: A Diffusion-Based Approach for Complex-Valued MRI Data Generation
Abstract:
Magnetic resonance imaging (MRI) raw data, or k-Space data, is complex-valued, containing both magnitude and phase information. However, clinical and existing Artificial Intelligence (AI)-based methods focus only on magnitude images, discarding the phase data despite its potential for downstream tasks, such as tumor segmentation and classification. In this work, we introduce $\textit{PhaseGen}$, a novel complex-valued diffusion model for generating synthetic MRI raw data conditioned on magnitude images, commonly used in clinical practice. This enables the creation of artificial complex-valued raw data, allowing pretraining for models that require k-Space information. We evaluate PhaseGen on two tasks: skull-stripping directly in k-Space and MRI reconstruction using the publicly available FastMRI dataset. Our results show that training with synthetic phase data significantly improves generalization for skull-stripping on real-world data, with an increased segmentation accuracy from $41.1\%$ to $80.1\%$, and enhances MRI reconstruction when combined with limited real-world data. This work presents a step forward in utilizing generative AI to bridge the gap between magnitude-based datasets and the complex-valued nature of MRI raw data. This approach allows researchers to leverage the vast amount of avaliable image domain data in combination with the information-rich k-Space data for more accurate and efficient diagnostic tasks. We make our code publicly $\href{https://github.com/TIO-IKIM/PhaseGen}{\text{available here}}$.
中文: 本研究提出PhaseGen这一复数值扩散模型,能从临床常用的磁共振幅度图像生成合成k空间数据,通过弥合图像数据与复数值原始数据之间的鸿沟,显著提升了颅骨剥离和图像重建等任务的性能。
English: This study introduces PhaseGen, a complex-valued diffusion model that generates synthetic MRI k-space data from magnitude images, enhancing tasks like skull-stripping and MRI reconstruction by bridging the gap between clinical image data and the full potential of complex-valued raw data.

Authors:Yuxiang Lin, Jingdong Sun, Zhi-Qi Cheng, Jue Wang, Haomin Liang, Zebang Cheng, Yifei Dong, Jun-Yan He, Xiaojiang Peng, Xian-Sheng Hua
Title: Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
Abstract:
Most existing emotion analysis emphasizes which emotion arises (e.g., happy, sad, angry) but neglects the deeper why. We propose Emotion Interpretation (EI), focusing on causal factors-whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)-that drive emotional responses. Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling. To facilitate EI research, we present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps-especially for more intricate scenarios-underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at: https://github.com/Lum1104/EIBench, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.
Chinese Summary: 本文提出情感解释(EI)新方法,聚焦于挖掘情绪背后的因果驱动因素而非简单分类,并推出大规模基准EIBench,通过基于推理的标注框架推动多模态因果分析与共情AI发展。
English Summary: This paper introduces Emotion Interpretation (EI), a novel approach that focuses on identifying the causal factors behind emotions rather than just classifying them, and presents EIBench, a large-scale benchmark with rationale-based explanations to advance EI research.

Authors:Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinhao Li, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yuhao Dong, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen, Zongyu Lin
Title: Kimi-VL Technical Report
Abstract:
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
Chinese: Kimi-VL是一款高效的开源视觉语言模型,具备先进的多模态推理、长上下文理解和强大智能体能力,在仅激活28亿参数的情况下,于多项挑战性任务中展现出卓越性能。
English: Kimi-VL is an efficient open-source vision-language model with advanced multimodal reasoning, long-context understanding, and strong agent capabilities, achieving competitive performance across various challenging tasks while activating only 2.8B parameters.

Authors:Erdenebileg Batbaatar, Jeonggeol Kim, Yongcheol Kim, Young Yoon
Title: Traversal Learning: A Lossless And Efficient Distributed Learning Framework
Abstract:
In this paper, we introduce Traversal Learning (TL), a novel approach designed to address the problem of decreased quality encountered in popular distributed learning (DL) paradigms such as Federated Learning (FL), Split Learning (SL), and SplitFed Learning (SFL). Traditional FL experiences from an accuracy drop during aggregation due to its averaging function, while SL and SFL face increased loss due to the independent gradient updates on each split network. TL adopts a unique strategy where the model traverses the nodes during forward propagation (FP) and performs backward propagation (BP) on the orchestrator, effectively implementing centralized learning (CL) principles within a distributed environment. The orchestrator is tasked with generating virtual batches and planning the sequential node visits of the model during FP, aligning them with the ordered index of the data within these batches. We conducted experiments on six datasets representing diverse characteristics across various domains. Our evaluation demonstrates that TL is on par with classic CL approaches in terms of accurate inference, thereby offering a viable and robust solution for DL tasks. TL outperformed other DL methods and improved accuracy by 7.85% for independent and identically distributed (IID) datasets, macro F1-score by 1.06% for non-IID datasets, accuracy by 2.60% for text classification, and AUC by 3.88% and 4.54% for medical and financial datasets, respectively. By effectively preserving data privacy while maintaining performance, TL represents a significant advancement in DL methodologies. The implementation of TL is available at https://github.com/neouly-inc/Traversal-Learning
中文: 遍历学习(TL)是一种新型分布式学习方法,通过实施集中式学习原则克服了现有范式的性能下降问题,在多种数据集上实现了卓越的准确性和隐私保护。
English: Traversal Learning (TL) is a novel distributed learning method that overcomes performance declines in existing paradigms by implementing centralized learning principles, achieving superior accuracy and privacy across diverse datasets.

Authors:Hengrun Zhao, Yunzhi Zhuge, Yifan Wang, Lijun Wang, Huchuan Lu, Yu Zeng
Title: Learning Universal Features for Generalizable Image Forgery Localization
Abstract:
In recent years, advanced image editing and generation methods have rapidly evolved, making detecting and locating forged image content increasingly challenging. Most existing image forgery detection methods rely on identifying the edited traces left in the image. However, because the traces of different forgeries are distinct, these methods can identify familiar forgeries included in the training data but struggle to handle unseen ones. In response, we present an approach for Generalizable Image Forgery Localization (GIFL). Once trained, our model can detect both seen and unseen forgeries, providing a more practical and efficient solution to counter false information in the era of generative AI. Our method focuses on learning general features from the pristine content rather than traces of specific forgeries, which are relatively consistent across different types of forgeries and therefore can be used as universal features to locate unseen forgeries. Additionally, as existing image forgery datasets are still dominated by traditional hand-crafted forgeries, we construct a new dataset consisting of images edited by various popular deep generative image editing methods to further encourage research in detecting images manipulated by deep generative models. Extensive experimental results show that the proposed approach outperforms state-of-the-art methods in the detection of unseen forgeries and also demonstrates competitive results for seen forgeries. The code and dataset are available at https://github.com/ZhaoHengrun/GIFL.
中文摘要:提出的通用图像伪造定位方法通过从原始图像内容中学习通用特征,有效检测已知和未知伪造类型,并借助新构建的数据集和广泛实验验证了其超越现有方法的性能。
English Summary: The proposed Generalizable Image Forgery Localization (GIFL) method learns universal features from pristine image content to effectively detect both seen and unseen forgeries, outperforming existing approaches through a new dataset and extensive experiments.

Authors:Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun
Title: How Can Objects Help Video-Language Understanding?
Abstract:
Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly modeled. To the other extreme, image captions by themselves provide strong empirical performances for understanding tasks, despite missing fine-grained spatiotemporal information. To answer this question, we introduce ObjectMLLM, a framework capable of leveraging arbitrary computer vision algorithm to extract and integrate structured visual representation. Through extensive evaluations on six video question answering benchmarks, we confirm that explicit integration of object-centric representation remains necessary. Surprisingly, we observe that the simple approach of quantizing the continuous, structured object information and representing them as plain text performs the best, offering a data-efficient approach to integrate other visual perception modules into MLLM design. Our code and models are released at https://github.com/brown-palm/ObjectMLLM.
Chinese: ObjectMLLM 证实了在多模态大语言模型中显式的以对象为中心的表示仍然必要,将结构化对象数据量化为纯文本的方法被证明是整合视觉感知模块的最有效方式。
English: ObjectMLLM demonstrates that explicit object-centric representation remains essential in multimodal large language models, with quantizing structured object data into plain text proving most effective for integrating visual perception modules.

Authors:Anzhen Li, Shufan Qing, Xiaochang Li, Rui Mao, Mingchen Feng
Title: Probability Estimation and Scheduling Optimization for Battery Swap Stations via LRU-Enhanced Genetic Algorithm and Dual-Factor Decision System
Abstract:
To address the challenges of limited Battery Swap Stations datasets, high operational costs, and fluctuating user charging demand, this research proposes a probability estimation model based on charging pile data and constructs nine scenario-specific battery swap demand datasets. In addition, this study combines Least Recently Used strategy with Genetic Algorithm and incorporates a guided search mechanism, which effectively enhances the global optimization capability. Thus, a dual-factor decision-making based charging schedule optimization system is constructed. Experimental results show that the constructed datasets exhibit stable trend characteristics, adhering to 24-hour and 168-hour periodicity patterns, with outlier ratios consistently below 3.26%, confirming data validity. Compared to baseline, the improved algorithm achieves better fitness individuals in 80% of test regions under the same iterations. When benchmarked against immediate swap-and-charge strategy, our algorithm achieves a peak cost reduction of 13.96%. Moreover, peak user satisfaction reaches 98.57%, while the average iteration time remains below 0.6 seconds, demonstrating good computational efficiency. The complete datasets and optimization algorithm are open-sourced at https://github.com/qingshufan/GA-EVLRU.
中文摘要:本研究针对换电站数据稀缺和运营成本高等问题,构建了基于充电桩数据的概率估计模型与九种场景数据集,通过改进遗传算法实现峰值成本降低13.96%,用户满意度达98.57%,且平均迭代时间保持在0.6秒内。
English Summary: This study develops a probability estimation model and optimized datasets to address battery swap station challenges, combining an enhanced Genetic Algorithm with LRU strategy to significantly reduce costs and improve user satisfaction while maintaining computational efficiency.

Authors:Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein
Title: LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a popular parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices $A$ as random projections and sparsifies the matrices $B$ using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in adapter merging by leveraging the orthogonality between adapter subspaces, and supports continual learning by using sparsity to mitigate catastrophic forgetting. Extensive experiments across natural language understanding, mathematical reasoning, code generation, and safety alignment tasks demonstrate that LoRI outperforms full fine-tuning and existing PEFT methods, while using up to 95% fewer trainable parameters than LoRA. In multi-task experiments, LoRI enables effective adapter merging and continual learning with reduced cross-task interference. Code is available at: https://github.com/juzhengz/LoRI
中文: LoRI是一种改进的微调方法,通过冻结随机投影矩阵并应用任务特定稀疏化,在保持优异性能的同时大幅减少可训练参数,相比LoRA减少高达95%参数且有效降低跨任务干扰。
English: LoRI is an enhanced fine-tuning method that freezes random projection matrices and applies task-specific sparsity to significantly reduce trainable parameters while outperforming full fine-tuning and existing PEFT methods, with up to 95% fewer parameters than LoRA and reduced cross-task interference.

Authors:Yixin Cao, Jiahao Ying, Yaoning Wang, Xipeng Qiu, Xuanjing Huang, Yugang Jiang
Title: Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Abstract:
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model's near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.
中文: 本文提出模型利用指数(MUI),通过量化推理过程中激活神经元比例来评估大语言模型效率,发现其与性能呈反比对数关系,并为模型评估与优化提供了四项实用推论。
English: This paper introduces the Model Utilization Index (MUI), an interpretable metric that measures the proportion of activated neurons during inference to assess LLMs' efficiency, revealing an inverse logarithmic relationship with performance and offering practical applications for model evaluation and improvement.

Authors:Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, Jiaxin Mao
Title: LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking
Abstract:
Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research and application in practice, we introduce a unified framework, \textbf{LLM4Ranking}, which enables users to adopt different ranking methods using open-source or closed-source API-based LLMs. Our framework provides a simple and extensible interface for document reranking with LLMs, as well as easy-to-use evaluation and fine-tuning scripts for this task. We conducted experiments based on this framework and evaluated various models and methods on several widely used datasets, providing reproducibility results on utilizing LLMs for document reranking. Our code is publicly available at https://github.com/liuqi6777/llm4ranking.
Chinese: LLM4Ranking框架为使用大语言模型进行文档重排提供了一个统一且可扩展的接口,使用户能够通过公开代码评估和微调多种模型与方法。
English: The LLM4Ranking framework provides a unified and extensible interface for document reranking using large language models, enabling users to evaluate and fine-tune various models and methods with publicly available code.

Authors:Chenxi Sun, Hongzhi Zhang, Qi Wang, Fuzheng Zhang
Title: Routing to the Right Expertise: A Trustworthy Judge for Instruction-based Image Editing
Abstract:
Instruction-based Image Editing (IIE) models have made significantly improvement due to the progress of multimodal large language models (MLLMs) and diffusion models, which can understand and reason about complex editing instructions. In addition to advancing current IIE models, accurately evaluating their output has become increasingly critical and challenging. Current IIE evaluation methods and their evaluation procedures often fall short of aligning with human judgment and often lack explainability. To address these limitations, we propose JUdgement through Routing of Expertise (JURE). Each expert in JURE is a pre-selected model assumed to be equipped with an atomic expertise that can provide useful feedback to judge output, and the router dynamically routes the evaluation task of a given instruction and its output to appropriate experts, aggregating their feedback into a final judge. JURE is trustworthy in two aspects. First, it can effortlessly provide explanations about its judge by examining the routed experts and their feedback. Second, experimental results demonstrate that JURE is reliable by achieving superior alignment with human judgments, setting a new standard for automated IIE evaluation. Moreover, JURE's flexible design is future-proof - modular experts can be seamlessly replaced or expanded to accommodate advancements in IIE, maintaining consistently high evaluation quality. Our evaluation data and results are available at https://github.com/Cyyyyyrus/JURE.git.
中文: JURE通过模块化专家路由系统,显著提升了基于指令的图像编辑自动评估能力,其可解释的判断结果与人类评估高度一致,且能灵活适应未来发展。
English: JURE introduces a modular expert-routing system that significantly improves automated evaluation of instruction-based image editing by providing explainable judgments that better align with human feedback and can flexibly adapt to future advancements.

Authors:Anning Hu, Ang Li, Xirui Jin, Danping Zou
Title: ThermoStereoRT: Thermal Stereo Matching in Real Time via Knowledge Distillation and Attention-based Refinement
Abstract:
We introduce ThermoStereoRT, a real-time thermal stereo matching method designed for all-weather conditions that recovers disparity from two rectified thermal stereo images, envisioning applications such as night-time drone surveillance or under-bed cleaning robots. Leveraging a lightweight yet powerful backbone, ThermoStereoRT constructs a 3D cost volume from thermal images and employs multi-scale attention mechanisms to produce an initial disparity map. To refine this map, we design a novel channel and spatial attention module. Addressing the challenge of sparse ground truth data in thermal imagery, we utilize knowledge distillation to boost performance without increasing computational demands. Comprehensive evaluations on multiple datasets demonstrate that ThermoStereoRT delivers both real-time capacity and robust accuracy, making it a promising solution for real-world deployment in various challenging environments. Our code will be released on https://github.com/SJTU-ViSYS-team/ThermoStereoRT
中文: ThermoStereoRT是一种实时热成像立体匹配方法,采用轻量级架构和多尺度注意力机制生成精确视差图,并通过知识蒸馏增强性能,适用于夜间无人机监控等全天候场景。
English: ThermoStereoRT is a real-time thermal stereo matching method that uses a lightweight backbone and multi-scale attention to generate accurate disparity maps, enhanced by knowledge distillation for robust performance in all-weather applications like drone surveillance.

Authors:Dongqi Fu, Yada Zhu, Zhining Liu, Lecheng Zheng, Xiao Lin, Zihao Li, Liri Fang, Katherine Tieu, Onkar Bhardwaj, Kommy Weldemariam, Hanghang Tong, Hendrik Hamann, Jingrui He
Title: ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method
Abstract:
Climate science studies the structure and dynamics of Earth's climate system and seeks to understand how climate changes over time, where the data is usually stored in the format of time series, recording the climate features, geolocation, time attributes, etc. Recently, much research attention has been paid to the climate benchmarks. In addition to the most common task of weather forecasting, several pioneering benchmark works are proposed for extending the modality, such as domain-specific applications like tropical cyclone intensity prediction and flash flood damage estimation, or climate statement and confidence level in the format of natural language. To further motivate the artificial general intelligence development for climate science, in this paper, we first contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns (1) the time series climate data from ERA5, (2) extreme weather events data from NOAA, and (3) satellite image data from NASA HLS based on a unified spatial-temporal granularity. Second, under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks in the proposed ClimateBench-M. The data and code of ClimateBench-M are publicly available at https://github.com/iDEA-iSAIL-Lab-UIUC/ClimateBench-M.
中文摘要:气候科学研究地球气候系统的结构与动态及其随时间的变化,本文提出了ClimateBench-M多模态基准,整合时间序列、极端天气和卫星数据,以推动气候人工智能应用的发展。
English Summary: Climate science examines Earth's climate system and its changes over time, with this paper introducing ClimateBench-M, a multi-modal benchmark integrating time series, extreme weather, and satellite data to advance climate AI applications.

Authors:Darian Tomašević, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir Štruc, Peter Peer
Title: ID-Booth: Identity-consistent Face Generation with Diffusion Models
Abstract:
Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.
中文摘要:本文提出ID-Booth这一新型扩散框架,通过三重身份训练目标增强合成图像的身份一致性,在保持预训练模型生成能力的同时,实现了比现有方法更优的身份保持与图像多样性。
English Summary: The paper introduces ID-Booth, a novel diffusion-based framework that enhances identity consistency in synthetic image generation through a triplet training objective, achieving superior identity preservation and image diversity compared to existing methods.

Authors:Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal
Title: Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
Abstract:
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.
Chinese: TaCQ是一种新颖的混合精度训练后量化方法,通过保留任务关键权重为16位精度同时量化其余权重,在低比特设置下以最小内存开销实现卓越的性能恢复。
English: TaCQ is a novel mixed-precision post-training quantization method that preserves task-critical weights at 16-bit precision while quantizing others, achieving superior performance recovery in low-bit settings with minimal memory overhead.

Authors:Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang
Title: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction
Abstract:
Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models will be released at https://github.com/IRMVLab/MMTwin.
中文: 该摘要提出MMTwin新型扩散模型,通过融合二维RGB图像、三维点云、历史手部路径及文本提示等多模态输入,并集成自我运动与手部轨迹预测双扩散机制,在实验中展现出卓越的三维手部轨迹预测性能及环境泛化能力。
English: This abstract introduces MMTwin, a novel diffusion model that leverages multimodal inputs including 2D RGB images, 3D point clouds, past hand waypoints, and text prompts to enhance 3D hand trajectory prediction by concurrently modeling camera egomotion and future hand movements, demonstrating superior performance and generalization in experiments.

Authors:Zhe Wang, Yuhua Ru, Aladine Chetouani, Fang Chen, Fabian Bauer, Liping Zhang, Didier Hans, Rachid Jennane, Mohamed Jarraya, Yung Hsin Chen
Title: MoEDiff-SR: Mixture of Experts-Guided Diffusion Model for Region-Adaptive MRI Super-Resolution
Abstract:
Magnetic Resonance Imaging (MRI) at lower field strengths (e.g., 3T) suffers from limited spatial resolution, making it challenging to capture fine anatomical details essential for clinical diagnosis and neuroimaging research. To overcome this limitation, we propose MoEDiff-SR, a Mixture of Experts (MoE)-guided diffusion model for region-adaptive MRI Super-Resolution (SR). Unlike conventional diffusion-based SR models that apply a uniform denoising process across the entire image, MoEDiff-SR dynamically selects specialized denoising experts at a fine-grained token level, ensuring region-specific adaptation and enhanced SR performance. Specifically, our approach first employs a Transformer-based feature extractor to compute multi-scale patch embeddings, capturing both global structural information and local texture details. The extracted feature embeddings are then fed into an MoE gating network, which assigns adaptive weights to multiple diffusion-based denoisers, each specializing in different brain MRI characteristics, such as centrum semiovale, sulcal and gyral cortex, and grey-white matter junction. The final output is produced by aggregating the denoised results from these specialized experts according to dynamically assigned gating probabilities. Experimental results demonstrate that MoEDiff-SR outperforms existing state-of-the-art methods in terms of quantitative image quality metrics, perceptual fidelity, and computational efficiency. Difference maps from each expert further highlight their distinct specializations, confirming the effective region-specific denoising capability and the interpretability of expert contributions. Additionally, clinical evaluation validates its superior diagnostic capability in identifying subtle pathological features, emphasizing its practical relevance in clinical neuroimaging. Our code is available at https://github.com/ZWang78/MoEDiff-SR.
中文: MoEDiff-SR是一种新型扩散模型,通过混合专家机制动态选择针对不同脑区的专用去噪器,在定量指标和临床诊断方面显著提升了磁共振图像超分辨率性能。
English: MoEDiff-SR is a novel diffusion model that uses a Mixture of Experts to dynamically apply specialized denoising for different brain regions, significantly improving MRI super-resolution performance in both quantitative metrics and clinical diagnosis.

Authors:Donghao Ren, Fred Hohman, Dominik Moritz
Title: A Scalable Approach to Clustering Embedding Projections
Abstract:
Interactive visualization of embedding projections is a useful technique for understanding data and evaluating machine learning models. Labeling data within these visualizations is critical for interpretation, as labels provide an overview of the projection and guide user navigation. However, most methods for producing labels require clustering the points, which can be computationally expensive as the number of points grows. In this paper, we describe an efficient clustering approach using kernel density estimation in the projected 2D space instead of points. This algorithm can produce high-quality cluster regions from a 2D density map in a few hundred milliseconds, orders of magnitude faster than current approaches. We contribute the design of the algorithm, benchmarks, and applications that demonstrate the utility of the algorithm, including labeling and summarization.
Chinese: 本文提出了一种在二维投影空间中使用核密度估计的高效聚类方法,能够快速生成高质量聚类区域用于交互式可视化标注,其速度显著优于现有方法。
English: This paper introduces an efficient clustering method using kernel density estimation in 2D projection space, enabling rapid generation of high-quality cluster regions for interactive visualization labeling and significantly outperforming existing approaches in speed.

Authors:Yousra Fettach, Adil Bahaj, Mounir Ghogho
Title: Skill Demand Forecasting Using Temporal Knowledge Graph Embeddings
Abstract:
Rapid technological advancements pose a significant threat to a large portion of the global workforce, potentially leaving them behind. In today's economy, there is a stark contrast between the high demand for skilled labour and the limited employment opportunities available to those who are not adequately prepared for the digital economy. To address this critical juncture and gain a deeper and more rapid understanding of labour market dynamics, in this paper, we approach the problem of skill need forecasting as a knowledge graph (KG) completion task, specifically, temporal link prediction. We introduce our novel temporal KG constructed from online job advertisements. We then train and evaluate different temporal KG embeddings for temporal link prediction. Finally, we present predictions of demand for a selection of skills practiced by workers in the information technology industry. The code and the data are available on our GitHub repository https://github.com/team611/JobEd.
中文摘要:本文通过构建时序知识图谱并完成链接预测任务,来应对技术进步对劳动力构成的威胁,旨在预测信息技术行业未来技能需求,相关代码与数据已开源。
English Summary: This paper addresses the threat of technological advancements to the workforce by using temporal knowledge graph completion to forecast skill demands, specifically predicting future needs for IT skills based on online job ads.

Authors:Mingxuan Li, Hanchen Li, Chenhao Tan
Title: HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Abstract:
Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
中文摘要:HypoEval是一种创新框架,通过少量人工评估生成详细评分标准并采用清单式方法整合维度得分,仅用30个样本即可实现与人类评估的最佳对齐,同时提供可解释的自动化评测。
English Summary: HypoEval is a novel framework that enhances LLM-based evaluation by generating detailed rubrics from minimal human input and using a checklist approach to combine dimension scores, achieving superior alignment with human judgments using only 30 samples while providing interpretable reasoning.

Authors:Nuren Zhaksylyk, Ibrahim Almakky, Jay Paranjape, S. Swaroop Vedula, Shameema Sikder, Vishal M. Patel, Mohammad Yaqub
Title: RP-SAM2: Refining Point Prompts for Stable Surgical Instrument Segmentation
Abstract:
Accurate surgical instrument segmentation is essential in cataract surgery for tasks such as skill assessment and workflow optimization. However, limited annotated data makes it difficult to develop fully automatic models. Prompt-based methods like SAM2 offer flexibility yet remain highly sensitive to the point prompt placement, often leading to inconsistent segmentations. We address this issue by introducing RP-SAM2, which incorporates a novel shift block and a compound loss function to stabilize point prompts. Our approach reduces annotator reliance on precise point positioning while maintaining robust segmentation capabilities. Experiments on the Cataract1k dataset demonstrate that RP-SAM2 improves segmentation accuracy, with a 2% mDSC gain, a 21.36% reduction in mHD95, and decreased variance across random single-point prompt results compared to SAM2. Additionally, on the CaDIS dataset, pseudo masks generated by RP-SAM2 for fine-tuning SAM2's mask decoder outperformed those generated by SAM2. These results highlight RP-SAM2 as a practical, stable and reliable solution for semi-automatic instrument segmentation in data-constrained medical settings. The code is available at https://github.com/BioMedIA-MBZUAI/RP-SAM2.
中文: RP-SAM2通过引入移位模块和复合损失函数来稳定点提示,在医疗数据有限的情况下显著提升了手术器械分割的准确性和稳定性,实验结果表明其性能优于SAM2模型。
English: RP-SAM2 enhances surgical instrument segmentation by stabilizing point prompts with a shift block and compound loss, achieving improved accuracy and reduced variance on medical datasets without requiring precise annotations.

Authors:Will LeVine, Bijan Varjavand
Title: Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
Abstract:
Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.
中文: 现代RAG系统若仅优化上下文相关性会制约回答质量,而REBEL方法通过多标准推理优化,实现了检索相关性和答案质量的双重提升。
English: Modern RAG systems often focus solely on maximizing context relevance, which can limit response quality, but the new REBEL method improves both relevance and answer quality by optimizing multiple criteria during inference.

Authors:Yubin Hong, Chaofan Li, Jingyi Zhang, Yingxia Shao
Title: FG-RAG: Enhancing Query-Focused Summarization with Context-Aware Fine-Grained Graph RAG
Abstract:
Retrieval-Augmented Generation (RAG) enables large language models to provide more precise and pertinent responses by incorporating external knowledge. In the Query-Focused Summarization (QFS) task, GraphRAG-based approaches have notably enhanced the comprehensiveness and diversity of generated responses. However, existing GraphRAG-based approaches predominantly focus on coarse-grained information summarization without being aware of the specific query, and the retrieved content lacks sufficient contextual information to generate comprehensive responses. To address the deficiencies of current RAG systems, we propose Context-Aware Fine-Grained Graph RAG (FG-RAG) to enhance the performance of the QFS task. FG-RAG employs Context-Aware Entity Expansion in graph retrieval to expand the coverage of retrieved entities in the graph, thus providing enough contextual information for the retrieved content. Furthermore, FG-RAG utilizes Query-Level Fine-Grained Summarization to incorporate fine-grained details during response generation, enhancing query awareness for the generated summarization. Our evaluation demonstrates that FG-RAG outperforms other RAG systems in multiple metrics of comprehensiveness, diversity, and empowerment when handling the QFS task. Our implementation is available at https://github.com/BuptWululu/FG-RAG.
中文: FG-RAG通过上下文感知的实体扩展和查询级细粒度摘要,显著提升了查询聚焦摘要任务的全面性、多样性和赋能效果,优于现有RAG系统。
English: FG-RAG enhances Query-Focused Summarization by expanding entity coverage and incorporating fine-grained details, outperforming existing RAG systems in comprehensiveness, diversity, and empowerment.

Authors:Gene Chou, Wenqi Xian, Guandao Yang, Mohamed Abdelfattah, Bharath Hariharan, Noah Snavely, Ning Yu, Paul Debevec
Title: FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
Abstract:
A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth
Chinese: FlashDepth 是一种实时视频深度估计方法,能够在24帧/秒下生成高分辨率、准确且一致的深度图,以较少训练数据显著超越现有模型的速度和边界清晰度,同时保持竞争力精度。
English: FlashDepth is a real-time video depth estimation method that achieves high-resolution, accurate, and consistent depth maps at 24 FPS, outperforming existing models in speed and boundary sharpness while maintaining competitive accuracy with minimal training data.

Authors:Alexander Rubinstein, Ameya Prabhu, Matthias Bethge, Seong Joon Oh
Title: Are We Done with Object-Centric Learning?
Abstract:
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called Object-Centric Classification with Applied Masks (OCCAM), demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available here: https://github.com/AlexanderRubinstein/OCCAM.
Chinese: 物体中心学习(OCL)旨在从场景中分离物体以提升泛化能力和效率,提出的OCCAM方法通过基于分割的编码显著优于传统基于槽位的方法,同时解决了实际应用中的挑战。
English: Object-centric learning (OCL) aims to isolate objects from scenes for improved generalization and efficiency, with the proposed OCCAM method using segmentation-based encoding to outperform traditional slot-based approaches while addressing challenges in real-world applications.

Authors:Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
Title: AssistanceZero: Scalably Solving Assistance Games
Abstract:
Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Our code and models are available at https://github.com/cassidylaidlaw/minecraft-building-assistance-game.
中文: 辅助博弈为训练AI助手提供了可扩展且有效的替代方案,通过AssistanceZero在《我的世界》等复杂环境中的卓越表现,减少了用户操作步骤并解决了欺骗性激励等问题。
English: Assistance games offer a scalable and effective alternative to RLHF for training AI assistants, as demonstrated by AssistanceZero's superior performance in complex environments like Minecraft, reducing user actions and addressing issues like deceptive incentives.

Authors:Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, Bo Zhang
Title: OmniCaptioner: One Captioner to Rule Them All
Abstract:
We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
中文: OmniCaptioner 是一个通用的视觉描述框架,可为多种视觉领域生成精细的文本描述,增强大语言模型的视觉推理能力、改进图像生成并实现高效的监督微调。
English: OmniCaptioner is a unified visual captioning framework that generates detailed descriptions for diverse visual domains, enhancing visual reasoning with LLMs, improving image generation, and enabling efficient supervised fine-tuning.

Authors:Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: A Unified Agentic Framework for Evaluating Conditional Image Generation
Abstract:
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
中文: 本文提出CIGEval框架,利用大型多模态模型全面评估条件图像生成任务,在多项实验中达到接近人工评估的相关性,并超越现有最优方法。
English: This paper presents CIGEval, an agentic framework using large multimodal models to comprehensively evaluate conditional image generation, achieving near-human correlation in assessments and surpassing previous state-of-the-art methods.

Authors:Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, Zhenyu Chen
Title: DeCoMa: Detecting and Purifying Code Dataset Watermarks through Dual Channel Code Abstraction
Abstract:
Watermarking is a technique to help identify the source of data points, which can be used to help prevent the misuse of protected datasets. Existing methods on code watermarking, leveraging the idea from the backdoor research, embed stealthy triggers as watermarks. Despite their high resilience against dilution attacks and backdoor detections, the robustness has not been fully evaluated. To fill this gap, we propose DeCoMa, a dual-channel approach to Detect and purify Code dataset waterMarks. To overcome the high barrier created by the stealthy and hidden nature of code watermarks, DeCoMa leverages dual-channel constraints on code to generalize and map code samples into standardized templates. Subsequently, DeCoMa extracts hidden watermarks by identifying outlier associations between paired elements within the standardized templates. Finally, DeCoMa purifies the watermarked dataset by removing all samples containing the detected watermark, enabling the silent appropriation of protected code. We conduct extensive experiments to evaluate the effectiveness and efficiency of DeCoMa, covering 14 types of code watermarks and 3 representative intelligent code tasks (a total of 14 scenarios). Experimental results demonstrate that DeCoMa achieves a stable recall of 100% in 14 code watermark detection scenarios, significantly outperforming the baselines. Additionally, DeCoMa effectively attacks code watermarks with embedding rates as low as 0.1%, while maintaining comparable model performance after training on the purified dataset. Furthermore, as DeCoMa requires no model training for detection, it achieves substantially higher efficiency than all baselines, with a speedup ranging from 31.5 to 130.9X. The results call for more advanced watermarking techniques for code models, while DeCoMa can serve as a baseline for future evaluation. Code is available at https://github.com/xiaoyuanpigo/DeCoMa
中文: DeCoMa采用双通道方法,通过标准化代码模板和识别异常关联来有效检测并清除代码数据集中的隐藏水印,在14种水印场景中实现100%召回率,同时检测效率显著超越基线方法。
English: DeCoMa is a dual-channel approach that effectively detects and purifies hidden watermarks in code datasets by standardizing code templates and identifying outlier associations, achieving 100% recall across 14 watermark types while significantly outperforming baselines in efficiency.

Authors:Tomohiro Hayase, Benoît Collins, Nakamasa Inoue
Title: Free Random Projection for In-Context Reinforcement Learning
Abstract:
Hierarchical inductive biases are hypothesized to promote generalizable policies in reinforcement learning, as demonstrated by explicit hyperbolic latent representations and architectures. Therefore, a more flexible approach is to have these biases emerge naturally from the algorithm. We introduce Free Random Projection, an input mapping grounded in free probability theory that constructs random orthogonal matrices where hierarchical structure arises inherently. The free random projection integrates seamlessly into existing in-context reinforcement learning frameworks by encoding hierarchical organization within the input space without requiring explicit architectural modifications. Empirical results on multi-environment benchmarks show that free random projection consistently outperforms the standard random projection, leading to improvements in generalization. Furthermore, analyses within linearly solvable Markov decision processes and investigations of the spectrum of kernel random matrices reveal the theoretical underpinnings of free random projection's enhanced performance, highlighting its capacity for effective adaptation in hierarchically structured state spaces.
中文摘要:自由随机投影是一种通过随机正交矩阵自然生成层次结构的新方法,无需修改架构即可提升强化学习的泛化能力,其优势已通过实证和理论分析得到验证。
English Summary: Free Random Projection is a novel method that inherently develops hierarchical structures through random orthogonal matrices, enhancing reinforcement learning generalization without architectural changes, as validated by empirical and theoretical analyses.

Authors:Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville
Title: Adaptive Computation Pruning for the Forgetting Transformer
Abstract:
The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs provably safe pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 50% to 70% reduction in attention runtime (or a 2-3$\times$ speedup) and a roughly 10% to 40% increase in end-to-end training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
中文摘要:针对遗忘变换器提出的自适应计算剪枝(ACP)方法通过基于遗忘门衰减动态剪枝可忽略计算,在保持性能不变的同时实现注意力计算2-3倍加速和10-40%训练吞吐量提升。
English Summary: The Adaptive Computation Pruning (ACP) method for the Forgetting Transformer dynamically prunes negligible computations based on forget gate decay, achieving 2-3× attention speedup and 10-40% training throughput gains without performance loss.

Authors:Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales
Title: Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition
Abstract:
Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.
中文: 本研究使用新型FoodNExTDB数据库和专家加权召回率指标评估了六种视觉语言模型,发现闭源模型在基础食物识别表现出色,但在烹饪方式等精细区分方面仍存在不足。
English: This study evaluates six Vision-Language Models for food recognition using the novel FoodNExTDB database and Expert-Weighted Recall metric, finding closed-source models excel at basic recognition but struggle with fine-grained distinctions like cooking styles.

Authors:Chang Nie, Yiqing Xu, Guangming Wang, Zhe Liu, Yanzi Miao, Hesheng Wang
Title: MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking
Abstract:
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.
中文摘要:提出的MovSAM框架利用思维链增强的多模态大语言模型,通过逻辑推理实现单幅图像中的运动物体分割,在多项基准测试中达到92.5%的最先进性能,有效解决了多帧方法失效的应用场景。
English Summary: The proposed MovSAM framework utilizes a Chain-of-Thought enhanced Multimodal Large Language Model to enable logic-driven moving object segmentation from single images, achieving state-of-the-art performance of 92.5% on benchmarks while addressing scenarios where multi-frame methods fail.

Authors:Yuxin Wang, Yiran Guo, Yining Zheng, Zhangyue Yin, Shuo Chen, Jie Yang, Jiajun Chen, Yuan Li, Xuanjing Huang, Xipeng Qiu
Title: FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Abstract:
The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool, including base and extended datasets, challenges LLMs with queries spanning from 1 to 4 relational hops (e.g., inferring familial connections and preferences) and 2 to 6 hops respectively, and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at \href{https://github.com/yxzwang/FamilyTool}{https://github.com/yxzwang/FamilyTool}.
中文: 本文提出基于家庭知识图谱的新基准FamilyTool,通过个性化多跳推理任务和归纳场景测试大语言模型,揭示了现有模型在处理复杂动态环境时存在的显著性能差距与泛化缺陷。
English: This paper introduces FamilyTool, a novel benchmark based on a family knowledge graph that challenges large language models with personalized, multi-hop reasoning tasks and inductive scenarios, revealing significant performance gaps and generalization issues in current models.

Authors:Pedro Hermosilla, Christian Stippel, Leon Sick
Title: Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
Abstract:
Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).
自监督学习已革新二维视觉但在三维场景理解中应用有限;本文提出一种鲁棒评估协议和新型掩码场景建模方法,使自监督三维特征无需微调即可达到监督学习性能。
Self-supervised learning has revolutionized 2D vision but remains underutilized in 3D scene understanding; this paper introduces both a robust evaluation protocol and a novel Masked Scene Modeling approach that enables self-supervised 3D features to match supervised performance without fine-tuning.

Authors:Alexandre Banks, Richard Cook, Septimiu E. Salcudean
Title: Setup-Invariant Augmented Reality for Teaching by Demonstration with Surgical Robots
Abstract:
Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.
中文:dV-STAR系统通过增强现实技术让新手外科医生能在专家指导下进行训练,无需相同设备配置即可显著提升手术任务表现并降低操作挫败感。
English: The dV-STAR system enables novice surgeons to train with expert guidance in augmented reality, significantly improving performance and reducing frustration in robotic surgery tasks without requiring identical equipment setups.

Authors:Elia Peruzzo, Dejia Xu, Xingqian Xu, Humphrey Shi, Nicu Sebe
Title: RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
Abstract:
Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.
中文: 本文提出一种检索增强框架,通过在生成过程中引入真实视频作为运动参照来提升视频生成的运动真实感,该方法仅需少量微调即可适配现有模型,并在多项评估中展现优越性能。
English: This paper introduces a retrieval-enhanced framework to improve motion realism in video generation by grounding diffusion models with real video examples, requiring minimal fine-tuning and demonstrating superior performance across metrics and benchmarks.

Authors:Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu
Title: Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Abstract:
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters
高质量图像描述对于跨模态应用至关重要,本研究提出了一种分而治之的策略,通过语义和空间分块增强细节并减少多模态模型中的幻觉,无需重新训练。
High-quality image captions are essential for cross-modal applications, and this study introduces a divide-then-aggregate strategy using semantic and spatial patches to enhance detail and reduce hallucinations in multimodal models without retraining.

Authors:Osama Ahmad, Zubair Khalid
Title: Robust and Noise-resilient Long-Term Prediction of Spatiotemporal Data Using Variational Mode Graph Neural Networks with 3D Attention
Abstract:
This paper focuses on improving the robustness of spatiotemporal long-term prediction using a variational mode graph convolutional network (VMGCN) by introducing 3D channel attention. The deep learning network for this task relies on historical data inputs, yet real-time data can be corrupted by sensor noise, altering its distribution. We model this noise as independent and identically distributed (i.i.d.) Gaussian noise and incorporate it into the LargeST traffic volume dataset, resulting in data with both inherent and additive noise components. Our approach involves decomposing the corrupted signal into modes using variational mode decomposition, followed by feeding the data into a learning pipeline for prediction. We integrate a 3D attention mechanism encompassing spatial, temporal, and channel attention. The spatial and temporal attention modules learn their respective correlations, while the channel attention mechanism is used to suppress noise and highlight the significant modes in the spatiotemporal signals. Additionally, a learnable soft thresholding method is implemented to exclude unimportant modes from the feature vector, and a feature reduction method based on the signal-to-noise ratio (SNR) is applied. We compare the performance of our approach against baseline models, demonstrating that our method achieves superior long-term prediction accuracy, robustness to noise, and improved performance with mode truncation compared to the baseline models. The code of the paper is available at https://github.com/OsamaAhmad369/VMGCN.
中文: 本文通过引入三维通道注意力机制到变分模态图卷积网络中,将含噪信号分解为模态并利用注意力抑制噪声,结合可学习阈值优化特征,从而提升了时空长期预测的鲁棒性和准确性。
English: This paper introduces a 3D channel attention mechanism into a variational mode graph convolutional network (VMGCN) to enhance the robustness and accuracy of spatiotemporal long-term predictions by decomposing noisy signals into modes, applying attention mechanisms to suppress noise, and implementing learnable thresholding for feature optimization.

Authors:Nan Peng, Xun Zhou, Mingming Wang, Guisong Chen, Wenqi Xu
Title: Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction
Abstract:
Safety constitutes a foundational imperative for autonomous driving systems, necessitating the maximal incorporation of accessible external prior information. This study establishes that temporal perception buffers and cost-efficient maps inherently form complementary prior sources for online vectorized high-definition (HD) map construction. We present Uni-PrevPredMap, a unified prior-informed framework that systematically integrates two synergistic information sources: previous predictions and simulated outdated HD maps. The framework introduces two core innovations: a tile-indexed 3D vectorized global map processor enabling efficient refreshment, storage, and retrieval of 3D vectorized priors; a tri-mode operational optimization paradigm ensuring consistency across non-prior, temporal-prior, and temporal-map-fusion-prior scenarios while mitigating reliance on idealized map fidelity assumptions. Uni-PrevPredMap achieves state-of-the-art performance in map-absent scenarios across established online vectorized HD map construction benchmarks. When provided with simulated outdated HD maps, the framework exhibits robust capabilities in error-resilient prior fusion, empirically confirming the synergistic complementarity between previous predictions and simulated outdated HD maps. Code will be available at https://github.com/pnnnnnnn/Uni-PrevPredMap.
中文:Uni-PrevPredMap框架通过整合时序感知缓存与模拟过时地图,在在线矢量化高精地图构建中实现了最优性能,有效验证了这两种先验信息源之间的互补协同性与错误容忍能力。
English: Uni-PrevPredMap is a unified framework that integrates temporal perception buffers and simulated outdated maps to achieve state-of-the-art performance in online vectorized HD map construction, demonstrating robust error resilience and complementary synergy between these prior information sources.

Authors:Hu Cui, Tessai Hayama
Title: HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network
Abstract:
3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: https://github.com/HuCui2022/HGMamba
中文: 提出的HGMamba模型通过并行超图卷积和混洗曼巴流分别捕捉局部关节依赖和全局时空特征,在基准数据集上实现了最先进的三维姿态重建精度。
English: The proposed HGMamba model introduces parallel Hyper-GCN and Shuffle Mamba streams to capture local joint dependencies and global spatio-temporal features, achieving state-of-the-art 3D pose reconstruction accuracy on benchmark datasets.

Authors:Ludvig Dillén, Per-Erik Forssén, Johan Edstedt
Title: FACT: Multinomial Misalignment Classification for Point Cloud Registration
Abstract:
We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. This is useful e.g. for quality assurance of large, automatically registered 3D models. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class. We generalize prior work that study binary alignment classification of registration errors, by recasting it as multinomial misalignment classification. To achieve this, we introduce a custom regression-by-classification loss function that combines the cross-entropy and Wasserstein losses, and demonstrate that it outperforms both direct regression and prior binary classification. FACT successfully classifies point-cloud pairs registered with both the classical ICP and GeoTransformer, while other choices, such as standard point-cloud-quality metrics and registration residuals are shown to be poor choices for predicting misalignment. On a synthetically perturbed point-cloud task introduced by the CorAl method, we show that FACT achieves substantially better performance than CorAl. Finally, we demonstrate how FACT can assist experts in correcting misaligned point-cloud maps. Our code is available at https://github.com/LudvigDillen/FACT_for_PCMC.
中文: FACT是一种通过点变换网络和定制损失函数预测激光雷达点云对配准质量的方法,在分类配准误差方面优于现有方法。
English: FACT is a method that predicts the alignment quality of registered lidar point cloud pairs using a point transformer-based network and a custom loss function, outperforming existing approaches in classifying misalignment.

Authors:Sujay Khandagale, Bhawna Juneja, Prabhat Agarwal, Aditya Subramanian, Jaewon Yang, Yuting Wang
Title: InteractRank: Personalized Web-Scale Search Pre-Ranking with Cross Interaction Features
Abstract:
Modern search systems use a multi-stage architecture to deliver personalized results efficiently. Key stages include retrieval, pre-ranking, full ranking, and blending, which refine billions of items to top selections. The pre-ranking stage, vital for scoring and filtering hundreds of thousands of items down to a few thousand, typically relies on two tower models due to their computational efficiency, despite often lacking in capturing complex interactions. While query-item cross interaction features are paramount for full ranking, integrating them into pre-ranking models presents efficiency-related challenges. In this paper, we introduce InteractRank, a novel two tower pre-ranking model with robust cross interaction features used at Pinterest. By incorporating historical user engagement-based query-item interactions in the scoring function along with the two tower dot product, InteractRank significantly boosts pre-ranking performance with minimal latency and computation costs. In real-world A/B experiments at Pinterest, InteractRank improves the online engagement metric by 6.5% over a BM25 baseline and by 3.7% over a vanilla two tower baseline. We also highlight other components of InteractRank, like real-time user-sequence modeling, and analyze their contributions through offline ablation studies. The code for InteractRank is available at https://github.com/pinterest/atg-research/tree/main/InteractRank.
Chinese: InteractRank提出了一种新颖的双塔预排序模型,通过整合强大的交叉交互特征,在保持低延迟和计算成本的同时显著提升了性能,在Pinterest的实验中在线互动指标比基线模型提高了6.5%。
English: InteractRank introduces a novel two-tower pre-ranking model that integrates robust cross interaction features, significantly enhancing performance with minimal latency and computation costs, as demonstrated by a 6.5% improvement in online engagement over baseline models at Pinterest.

Authors:Junrui Zhang, Chenjie Wang, Jie Peng, Haoyu Li, Jianmin Ji, Yu Zhang, Yanyong Zhang
Title: CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving
Abstract:
Imitation learning based planning tasks on the nuPlan dataset have gained great interest due to their potential to generate human-like driving behaviors. However, open-loop training on the nuPlan dataset tends to cause causal confusion during closed-loop testing, and the dataset also presents a long-tail distribution of scenarios. These issues introduce challenges for imitation learning. To tackle these problems, we introduce CAFE-AD, a Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving method, designed to enhance feature representation across various scenario types. We develop an adaptive feature pruning module that ranks feature importance to capture the most relevant information while reducing the interference of noisy information during training. Moreover, we propose a cross-scenario feature interpolation module that enhances scenario information to introduce diversity, enabling the network to alleviate over-fitting in dominant scenarios. We evaluate our method CAFE-AD on the challenging public nuPlan Test14-Hard closed-loop simulation benchmark. The results demonstrate that CAFE-AD outperforms state-of-the-art methods including rule-based and hybrid planners, and exhibits the potential in mitigating the impact of long-tail distribution within the dataset. Additionally, we further validate its effectiveness in real-world environments. The code and models will be made available at https://github.com/AlniyatRui/CAFE-AD.
Chinese: CAFE-AD方法通过自适应特征剪枝和跨场景插值模块,有效解决了nuPlan模仿学习中的因果混淆和长尾分布问题,在基准测试和实际验证中均展现出优于现有方法的性能。
English: The proposed CAFE-AD method addresses causal confusion and long-tail distribution challenges in nuPlan-based imitation learning by introducing adaptive feature pruning and cross-scenario interpolation modules, demonstrating superior performance in both benchmark tests and real-world validation.

Authors:Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, Shiyu Chang
Title: Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning
Abstract:
Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.
中文: 针对LLM水印技术面临的安全挑战,特别是欺骗攻击可在保留水印的同时恶意篡改文本含义,本研究提出一种语义感知水印算法,通过后处理嵌入水印,在保持高检测率的同时有效抵御移除攻击和欺骗攻击。
English: Watermarking for LLMs faces security challenges from spoofing attacks, which can maliciously alter text meaning while preserving watermarks, and this study proposes a semantic-aware algorithm that embeds watermarks post-hoc to ensure robustness against removal and security against spoofing, maintaining high detectability.

Authors:Minshuo Chen, Renyuan Xu, Yumin Xu, Ruixun Zhang
Title: Diffusion Factor Models: Generating High-Dimensional Returns with Factor Structure
Abstract:
Financial scenario simulation is essential for risk management and portfolio optimization, yet it remains challenging especially in high-dimensional and small data settings common in finance. We propose a diffusion factor model that integrates latent factor structure into generative diffusion processes, bridging econometrics with modern generative AI to address the challenges of the curse of dimensionality and data scarcity in financial simulation. By exploiting the low-dimensional factor structure inherent in asset returns, we decompose the score function--a key component in diffusion models--using time-varying orthogonal projections, and this decomposition is incorporated into the design of neural network architectures. We derive rigorous statistical guarantees, establishing nonasymptotic error bounds for both score estimation at O(d^{5/2} n^{-2/(k+5)}) and generated distribution at O(d^{5/4} n^{-1/2(k+5)}), primarily driven by the intrinsic factor dimension k rather than the number of assets d, surpassing the dimension-dependent limits in the classical nonparametric statistics literature and making the framework viable for markets with thousands of assets. Numerical studies confirm superior performance in latent subspace recovery under small data regimes. Empirical analysis demonstrates the economic significance of our framework in constructing mean-variance optimal portfolios and factor portfolios. This work presents the first theoretical integration of factor structure with diffusion models, offering a principled approach for high-dimensional financial simulation with limited data. Our code is available at https://github.com/xymmmm00/diffusion_factor_model.
中文摘要:本文提出扩散因子模型,将潜在因子结构与生成扩散过程相结合,有效解决金融模拟中的高维数据稀缺难题,并通过理论证明和实证分析展示了其在投资组合优化中的显著优势。
English Summary: The paper introduces a diffusion factor model that combines latent factor structures with generative diffusion processes to overcome dimensionality and data scarcity challenges in financial simulations, providing theoretical guarantees and demonstrating effectiveness in portfolio optimization.

Authors:Xiaohang Yang, Qing Wang, Jiahao Yang, Gregory Slabaugh, Shanxin Yuan
Title: STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints
Abstract:
Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code page: https://github.com/XiaohangYang829/STaR.
中文摘要:本文提出STaR模型,通过空间模块的密集形状表征与穿透约束保障几何合理性,结合时序模块的变换器与一致性约束实现运动平滑,在保持运动语义的同时显著减少了穿透现象。
English Summary: The paper introduces STaR, a sequence-to-sequence model that integrates spatial and temporal modules with penetration and consistency constraints to achieve balanced motion retargeting by preserving semantics while ensuring geometric plausibility and temporal smoothness.

Authors:Halid Abdulrahim Kadi, Kasim Terzić
Title: Agent-Arena: A General Framework for Evaluating Control Algorithms
Abstract:
Robotic research is inherently challenging, requiring expertise in diverse environments and control algorithms. Adapting algorithms to new environments often poses significant difficulties, compounded by the need for extensive hyper-parameter tuning in data-driven methods. To address these challenges, we present Agent-Arena, a Python framework designed to streamline the integration, replication, development, and testing of decision-making policies across a wide range of benchmark environments. Unlike existing frameworks, Agent-Arena is uniquely generalised to support all types of control algorithms and is adaptable to both simulation and real-robot scenarios. Please see our GitHub repository https://github.com/halid1020/agent-arena-v0.
中文: Agent-Arena 是一个通用的 Python 框架,旨在简化决策策略在各种基准环境中的集成、复制、开发和测试,有效应对机器人研究中算法适应性和参数调整等难题。
English: Agent-Arena is a versatile Python framework that facilitates the integration, replication, development, and testing of decision-making policies across diverse benchmark environments, overcoming challenges in robotic research such as algorithm adaptation and hyper-parameter tuning.

Authors:Adam McArthur, Stephanie Wichuk, Stephen Burnside, Andrew Kirby, Alexander Scammon, Damian Sol, Abhilash Hareendranathan, Jacob L. Jaremko
Title: Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI
Abstract:
Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: https://github.com/radoss-org/retuve
中文:Retuve是一个开源框架,旨在解决髋关节发育不良的诊断难题,通过提供标准化的多模态分析工具、开放数据集和预训练模型,促进筛查准确性和早期干预,改善患者预后。
English: Retuve is an open-source framework designed to overcome diagnostic challenges in developmental dysplasia of the hip by providing standardized, multi-modality analysis tools with accessible datasets and models to enhance screening accuracy and early intervention.

Authors:Ildi Alla, Selma Yahia, Valeria Loscri
Title: TRIDENT: Tri-modal Real-time Intrusion Detection Engine for New Targets
Abstract:
The increasing availability of drones and their potential for malicious activities pose significant privacy and security risks, necessitating fast and reliable detection in real-world environments. However, existing drone detection systems often struggle in real-world settings due to environmental noise and sensor limitations. This paper introduces TRIDENT, a tri-modal drone detection framework that integrates synchronized audio, visual, and RF data to enhance robustness and reduce dependence on individual sensors. TRIDENT introduces two fusion strategies - Late Fusion and GMU Fusion - to improve multi-modal integration while maintaining efficiency. The framework incorporates domain-specific feature extraction techniques alongside a specialized data augmentation pipeline that simulates real-world sensor degradation to improve generalization capabilities. A diverse multi-sensor dataset is collected in urban and non-urban environments under varying lighting conditions, ensuring comprehensive evaluation. Experimental results show that TRIDENT achieves 98.8 percent accuracy in real-world recordings and 83.26 percent in a more complex setting (augmented data), outperforming unimodal and dual-modal baselines. Moreover, TRIDENT operates in real-time, detecting drones in just 6.09 ms while consuming only 75.27 mJ per detection, making it highly efficient for resource-constrained devices. The dataset and code have been released to ensure reproducibility (https://github.com/TRIDENT-2025/TRIDENT).
中文: 本文提出TRIDENT三模态无人机检测框架,通过融合音频、视觉和射频数据,在多种环境中实现高精度实时检测,性能优于现有方法。
English: This paper presents TRIDENT, a tri-modal drone detection framework that integrates audio, visual, and RF data to achieve high accuracy and real-time efficiency in diverse environments, outperforming existing methods.

Authors:Huzaifa Arif, Keerthiram Murugesan, Payel Das, Alex Gittens, Pin-Yu Chen
Title: PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks
Abstract:
This paper explores inference-time data leakage risks of deep neural networks (NNs), where a curious and honest model service provider is interested in retrieving users' private data inputs solely based on the model inference results. Particularly, we revisit residual NNs due to their popularity in computer vision and our hypothesis that residual blocks are a primary cause of data leakage owing to the use of skip connections. By formulating inference-time data leakage as a constrained optimization problem, we propose a novel backward feature inversion method, \textbf{PEEL}, which can effectively recover block-wise input features from the intermediate output of residual NNs. The surprising results in high-quality input data recovery can be explained by the intuition that the output from these residual blocks can be considered as a noisy version of the input and thus the output retains sufficient information for input recovery. We demonstrate the effectiveness of our layer-by-layer feature inversion method on facial image datasets and pre-trained classifiers. Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE). The code is available at \href{https://github.com/Huzaifa-Arif/PEEL}{https://github.com/Huzaifa-Arif/PEEL}
中文: 本文提出PEEL这一新型逆向特征还原方法,通过将残差神经网络中间输出视为输入的噪声版本,有效实现输入数据恢复,在人脸图像恢复任务中展现出比现有技术更优越的性能。
English: This paper introduces PEEL, a novel backward feature inversion method that effectively recovers input data from residual neural networks by treating intermediate outputs as noisy versions of inputs, demonstrating superior performance over existing techniques in facial image recovery.

Authors:Jonas Torzewski
Title: Physical spline for denoising object trajectory data by combining splines, ML feature regression and model knowledge
Abstract:
This article presents a method for estimating the dynamic driving states (position, velocity, acceleration and heading) from noisy measurement data. The proposed approach is effective with both complete and partial observations, producing refined trajectory signals with kinematic consistency, ensuring that velocity is the integral of acceleration and position is the integral of velocity. Additionally, the method accounts for the constraint that vehicles can only move in the direction of their orientation. The method is implemented as a configurable python library that also enables trajectory estimation solely based on position data. Regularization is applied to prevent extreme state variations. A key application is enhancing recorded trajectory data for use as reference inputs in machine learning models. At the end, the article presents the results of the method along with a comparison to ground truth data.
中文: 本文提出一种基于Python的动态驾驶状态估计方法,能从含噪声数据中推算车辆运动轨迹,确保运动学一致性并支持仅凭位置数据优化轨迹,适用于机器学习模型的数据增强。
English: This article introduces a Python-based method for estimating dynamic driving states from noisy data, ensuring kinematic consistency and orientation constraints while enabling trajectory refinement for machine learning applications.

Authors:Zixuan Yi, Yao Tian, Zachary G. Ives, Ryan Marcus
Title: Low Rank Learning for Offline Query Optimization
Abstract:
Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce \textsc{LimeQO}, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ``hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS. The code is available in \href{https://github.com/zixy17/LimeQO}{https://github.com/zixy17/LimeQO}.
中文: LimeQO提出了一种低秩学习框架,通过线性方法离线优化查询,高效预测查询计划延迟并减少工作量探索时间,提供低开销且无性能回退的保障。
English: LimeQO introduces a low-rank learning framework for offline query optimization that uses linear methods to efficiently predict query plan latencies and reduce workload exploration time, offering a low-overhead solution with a no-regressions guarantee.

Authors:Hritam Basak, Zhaozheng Yin
Title: SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Abstract:
Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.
中文: 本文首次提出语言引导的半监督领域自适应方法用于语义分割,通过利用视觉语言模型增强类别区分能力并采用新型损失函数解决数据不平衡问题,显著超越了现有最优方法。
English: This paper introduces the first language-guided semi-supervised domain adaptation (SSDA) method for semantic segmentation, leveraging vision-language models to enhance class discrimination and address data imbalance through novel loss formulations, achieving state-of-the-art performance.

Authors:Hicham Talaoubrid, Anissa Mokraoui, Ismail Ben Ayed, Axel Prouvost, Sonimith Hang, Monit Korn, Rémi Harvey
Title: Analyzing the Impact of Low-Rank Adaptation for Cross-Domain Few-Shot Object Detection in Aerial Images
Abstract:
This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results show that LoRA applied after an initial fine-tuning slightly improves performance in low-shot settings (e.g., 1-shot and 5-shot), while full fine-tuning remains more effective in higher-shot configurations. These findings highlight LoRA's potential for efficient adaptation in aerial object detection, encouraging further research into parameter-efficient fine-tuning strategies for few-shot learning. Our code is available here: https://github.com/HichTala/LoRA-DiffusionDet.
中文: 本研究将低秩自适应(LoRA)应用于小模型进行少样本航空目标检测,表明其在少样本场景下能轻微提升性能,而完整微调在数据充足时效果更佳。
English: This study applies Low-Rank Adaptation (LoRA) to small models for few-shot aerial object detection, showing it slightly improves performance in low-shot scenarios while full fine-tuning works better with more data.

Authors:Bailey J. Eccles, Leon Wong, Blesson Varghese
Title: Mosaic: Composite Projection Pruning for Resource-efficient LLMs
Abstract:
Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models. Mosaic is available for public use from https://github.com/blessonvar/Mosaic
中文摘要:本文提出Mosaic系统,采用复合投影剪枝技术高效压缩大语言模型,相比现有方法,在模型生成速度、准确性和资源消耗方面均实现显著优化。
English Summary: This paper introduces Mosaic, a system utilizing composite projection pruning to efficiently compress large language models, achieving faster model generation, improved accuracy, and reduced resource consumption compared to existing methods.

Authors:Ziwei Yang, Takeyuki Tamura
Title: DeepGDel: Deep Learning-based Gene Deletion Prediction Framework for Growth-Coupled Production in Genome-Scale Metabolic Models
Abstract:
In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveraging new data-driven paradigms, such as machine learning, for more efficient strain design. Therefore, it is necessary to propose a fundamental framework for this objective. In this study, we first formulate the problem of gene deletion strategy prediction and then propose a framework for predicting gene deletion strategies for growth-coupled production in genome-scale metabolic models. The proposed framework leverages deep learning algorithms to learn and integrate sequential gene and metabolite data representation, enabling the automatic gene deletion strategy prediction. Computational experiment results demonstrate the feasibility of the proposed framework, showing substantial improvements over baseline methods. Specifically, the proposed framework achieves a 14.69%, 22.52%, and 13.03% increase in overall accuracy across three metabolic models of different scales under study, while maintaining balanced precision and recall in predicting gene deletion statuses. The source code and examples for the framework are publicly available at https://github.com/MetNetComp/DeepGDel.
中文: 本研究提出一个深度学习框架,用于预测代谢模型中生长偶联生产的基因敲除策略,在多个模型规模上相比基线方法展现出显著提升的预测准确性。
English: This study introduces a deep learning framework that predicts gene deletion strategies for growth-coupled production in metabolic models, demonstrating improved accuracy over baseline methods across multiple model scales.

Authors:Ziwei Yang, Takeyuki Tamura
Title: DeepGDel: Deep Learning-based Gene Deletion Prediction Framework for Growth-Coupled Production in Genome-Scale Metabolic Models
Abstract:
In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveraging new data-driven paradigms, such as machine learning, for more efficient strain design. Therefore, it is necessary to propose a fundamental framework for this objective. In this study, we first formulate the problem of gene deletion strategy prediction and then propose a framework for predicting gene deletion strategies for growth-coupled production in genome-scale metabolic models. The proposed framework leverages deep learning algorithms to learn and integrate sequential gene and metabolite data representation, enabling the automatic gene deletion strategy prediction. Computational experiment results demonstrate the feasibility of the proposed framework, showing substantial improvements over baseline methods. Specifically, the proposed framework achieves a 14.69%, 22.52%, and 13.03% increase in overall accuracy across three metabolic models of different scales under study, while maintaining balanced precision and recall in predicting gene deletion statuses. The source code and examples for the framework are publicly available at https://github.com/MetNetComp/DeepGDel.
中文: 本研究提出一个深度学习框架,用于预测代谢模型中生长偶联生产的基因敲除策略,在多个模型规模上相比基线方法展现出显著提升的预测准确性。
English: This study introduces a deep learning framework that predicts gene deletion strategies for growth-coupled production in metabolic models, demonstrating improved accuracy over baseline methods across multiple model scales.

Authors:Mohsen Jenadeleh, Jon Sneyers, Panqi Jia, Shima Mohammadi, Joao Ascenso, Dietmar Saupe
Title: Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression
Abstract:
Learning-based image compression methods have recently emerged as promising alternatives to traditional codecs, offering improved rate-distortion performance and perceptual quality. JPEG AI represents the latest standardized framework in this domain, leveraging deep neural networks for high-fidelity image reconstruction. In this study, we present a comprehensive subjective visual quality assessment of JPEG AI-compressed images using the JPEG AIC-3 methodology, which quantifies perceptual differences in terms of Just Noticeable Difference (JND) units. We generated a dataset of 50 compressed images with fine-grained distortion levels from five diverse sources. A large-scale crowdsourced experiment collected 96,200 triplet responses from 459 participants. We reconstructed JND-based quality scales using a unified model based on boosted and plain triplet comparisons. Additionally, we evaluated the alignment of multiple objective image quality metrics with human perception in the high-fidelity range. The CVVDP metric achieved the overall highest performance; however, most metrics including CVVDP were overly optimistic in predicting the quality of JPEG AI-compressed images. These findings emphasize the necessity for rigorous subjective evaluations in the development and benchmarking of modern image codecs, particularly in the high-fidelity range. Another technical contribution is the introduction of the well-known Meng-Rosenthal-Rubin statistical test to the field of Quality of Experience research. This test can reliably assess the significance of difference in performance of quality metrics in terms of correlation between metrics and ground truth. The complete dataset, including all subjective scores, is publicly available at https://github.com/jpeg-aic/dataset-JPEG-AI-SDR25.
中文: 本研究对JPEG AI压缩图像进行了全面的主观质量评估,发现多数客观指标高估了其感知质量,并强调了在编解码器开发中严格人类评估的必要性。
English: This study conducts a comprehensive subjective quality assessment of JPEG AI-compressed images, revealing that most objective metrics overestimate their perceptual quality and underscoring the need for rigorous human evaluation in codec development.

Authors:Hongbin Liang, Hezhe Qiao, Wei Huang, Qizhou Wang, Mingsheng Shang, Lin Chen
Title: Temporal-contextual Event Learning for Pedestrian Crossing Intent Prediction
Abstract:
Ensuring the safety of vulnerable road users through accurate prediction of pedestrian crossing intention (PCI) plays a crucial role in the context of autonomous and assisted driving. Analyzing the set of observation video frames in ego-view has been widely used in most PCI prediction methods to forecast the cross intent. However, they struggle to capture the critical events related to pedestrian behaviour along the temporal dimension due to the high redundancy of the video frames, which results in the sub-optimal performance of PCI prediction. Our research addresses the challenge by introducing a novel approach called \underline{T}emporal-\underline{c}ontextual Event \underline{L}earning (TCL). The TCL is composed of the Temporal Merging Module (TMM), which aims to manage the redundancy by clustering the observed video frames into multiple key temporal events. Then, the Contextual Attention Block (CAB) is employed to adaptively aggregate multiple event features along with visual and non-visual data. By synthesizing the temporal feature extraction and contextual attention on the key information across the critical events, TCL can learn expressive representation for the PCI prediction. Extensive experiments are carried out on three widely adopted datasets, including PIE, JAAD-beh, and JAAD-all. The results show that TCL substantially surpasses the state-of-the-art methods. Our code can be accessed at https://github.com/dadaguailhb/TCL.
Chinese: 本研究提出了一种新颖的时序上下文事件学习(TCL)方法,通过将视频帧聚类为关键时序事件并自适应整合上下文特征,显著提升了行人过街意图预测性能,在多个数据集上大幅超越现有最优方法。
English: This study introduces a novel Temporal-contextual Event Learning (TCL) approach that enhances pedestrian crossing intention prediction by clustering video frames into key temporal events and adaptively integrating contextual features, significantly outperforming existing methods across multiple datasets.

Authors:Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, Peter Norgaard
Title: FEABench: Evaluating Language Models on Multiphysics Reasoning Ability
Abstract:
Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\circledR$, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench
中文: FEABench是一个基准测试,旨在评估大型语言模型及其代理通过有限元分析模拟和解决物理、数学及工程问题的能力,以推动人工智能与数值求解器结合,提升工程领域的自动化水平。
English: FEABench is a benchmark designed to assess how well large language models and their agents can simulate and solve physics, math, and engineering problems using finite element analysis, with the goal of enhancing automation in these fields by integrating AI with numerical solvers.

Authors:Krithi Shailya, Shreya Rajpal, Gokul S Krishnan, Balaraman Ravindran
Title: LExT: Towards Evaluating Trustworthiness of Natural Language Explanations
Abstract:
As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there have been several approaches proposed toward generating natural language explanations. These explanations are crucial for enhancing the interpretability of a model, especially in sensitive domains like healthcare, where transparency and reliability are key. In light of such explanations being generated by LLMs and its known concerns, there is a growing need for robust evaluation frameworks to assess model-generated explanations. Natural Language Generation metrics like BLEU and ROUGE capture syntactic and semantic accuracies but overlook other crucial aspects such as factual accuracy, consistency, and faithfulness. To address this gap, we propose a general framework for quantifying trustworthiness of natural language explanations, balancing Plausibility and Faithfulness, to derive a comprehensive Language Explanation Trustworthiness Score (LExT) (The code and set up to reproduce our experiments are publicly available at https://github.com/cerai-iitm/LExT). Applying our domain-agnostic framework to the healthcare domain using public medical datasets, we evaluate six models, including domain-specific and general-purpose models. Our findings demonstrate significant differences in their ability to generate trustworthy explanations. On comparing these explanations, we make interesting observations such as inconsistencies in Faithfulness demonstrated by general-purpose models and their tendency to outperform domain-specific fine-tuned models. This work further highlights the importance of using a tailored evaluation framework to assess natural language explanations in sensitive fields, providing a foundation for improving the trustworthiness and transparency of language models in healthcare and beyond.
中文: 本文提出了一个领域无关的LExT框架,通过平衡合理性与忠实性来评估自然语言解释的可信度,弥补了现有指标的不足,并在医疗领域应用中揭示了不同模型生成可信解释能力的显著差异。
English: This paper introduces a domain-agnostic framework called LExT to evaluate the trustworthiness of natural language explanations by balancing plausibility and faithfulness, addressing limitations of existing metrics and demonstrating its application in healthcare with significant model performance variations.

Authors:Xiaoxing Hu, Ziyang Gong, Yupei Wang, Yuru Jia, Gen Luo, Xue Yang
Title: Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter's effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.
中文: Earth-Adapter是一种专为遥感设计的参数高效微调方法,通过混合频率适配技术有效分离并克服图像伪影,在领域适应和泛化基准测试中显著优于现有方法。
English: Earth-Adapter is a novel Parameter-Efficient Fine-Tuning method designed specifically for remote sensing, using a Mixture of Frequency Adaptation to effectively separate and overcome artifacts, significantly outperforming previous methods in domain adaptation and generalization benchmarks.

Authors:Qing Xu, Zhenye Lou, Chenxin Li, Yue Li, Xiangjian He, Tesema Fiseha Berhanu, Rong Qu, Wenting Duan, Zhen Chen
Title: HER-Seg: Holistically Efficient Segmentation for High-Resolution Medical Images
Abstract:
High-resolution segmentation is critical for precise disease diagnosis by extracting fine-grained morphological details. Existing hierarchical encoder-decoder frameworks have demonstrated remarkable adaptability across diverse medical segmentation tasks. While beneficial, they usually require the huge computation and memory cost when handling large-size segmentation, which limits their applications in foundation model building and real-world clinical scenarios. To address this limitation, we propose a holistically efficient framework for high-resolution medical image segmentation, called HER-Seg. Specifically, we first devise a computation-efficient image encoder (CE-Encoder) to model long-range dependencies with linear complexity while maintaining sufficient representations. In particular, we introduce the dual-gated linear attention (DLA) mechanism to perform cascaded token filtering, selectively retaining important tokens while ignoring irrelevant ones to enhance attention computation efficiency. Then, we introduce a memory-efficient mask decoder (ME-Decoder) to eliminate the demand for the hierarchical structure by leveraging cross-scale segmentation decoding. Extensive experiments reveal that HER-Seg outperforms state-of-the-arts in high-resolution medical 2D, 3D and video segmentation tasks. In particular, our HER-Seg requires only 0.59GB training GPU memory and 9.39G inference FLOPs per 1024$\times$1024 image, demonstrating superior memory and computation efficiency. The code is available at https://github.com/xq141839/HER-Seg.
Chinese: HER-Seg是一种高效的高分辨率医学图像分割框架,通过计算高效编码器和内存高效解码器显著降低资源消耗,在多种分割任务中超越现有最优方法。
English: HER-Seg is a highly efficient framework for high-resolution medical image segmentation that reduces computational and memory costs through a computation-efficient encoder and memory-efficient decoder, outperforming state-of-the-art methods in various tasks.

Authors:Yujia Hu, Songhua Liu, Xingyi Yang, Xinchao Wang
Title: Flash Sculptor: Modular 3D Worlds from Objects
Abstract:
Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance. Codes are available at https://github.com/YujiaHu1109/Flash-Sculptor.
中文: Flash Sculptor采用分治策略实现单图像组合式3D重建,通过粗到精旋转估计和离群点去除算法,在提速三倍的同时刷新了性能基准。
English: Flash Sculptor introduces a divide-and-conquer framework for efficient compositional 3D reconstruction from a single image, achieving at least 3x faster speeds and superior performance without iterative optimization.

Authors:Saad Wazir, Daeyoung Kim
Title: Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion
Abstract:
Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: https://github.com/saadwazir/ReN-UNet
中文摘要:该研究提出的嵌套UNet架构通过多尺度特征融合和注意力机制,有效解决了医学图像分割中特征提取的局限性,在多个数据集上超越了现有最优方法的性能表现。
English Summary: The proposed nested UNet architecture with multiscale feature fusion and attention mechanisms effectively addresses limitations in medical image segmentation by enhancing feature integration and spatial detail restoration, outperforming current state-of-the-art methods across multiple datasets.

Authors:Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
Title: V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation(V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elo-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.
中文: 我们提出V-MAGE这一基于游戏的评估框架,用于测试多模态大语言模型在交互环境中的动态视觉推理能力,结果显示尽管模型在简单任务中接近人类水平,但在需要复杂推理的任务中仍存在显著差距。
English: We introduce V-MAGE, a game-based framework that evaluates Multimodal Large Language Models' dynamic visual reasoning in interactive environments, revealing significant performance gaps in complex tasks compared to humans despite near-human ability in simple scenarios.

Authors:Luigi Tresca, Carolin Schmidt, James Harrison, Filipe Rodrigues, Gioele Zardini, Daniele Gammelli, Marco Pavone
Title: Robo-taxi Fleet Coordination at Scale via Reinforcement Learning
Abstract:
Fleets of robo-taxis offering on-demand transportation services, commonly known as Autonomous Mobility-on-Demand (AMoD) systems, hold significant promise for societal benefits, such as reducing pollution, energy consumption, and urban congestion. However, orchestrating these systems at scale remains a critical challenge, with existing coordination algorithms often failing to exploit the systems' full potential. This work introduces a novel decision-making framework that unites mathematical modeling with data-driven techniques. In particular, we present the AMoD coordination problem through the lens of reinforcement learning and propose a graph network-based framework that exploits the main strengths of graph representation learning, reinforcement learning, and classical operations research tools. Extensive evaluations across diverse simulation fidelities and scenarios demonstrate the flexibility of our approach, achieving superior system performance, computational efficiency, and generalizability compared to prior methods. Finally, motivated by the need to democratize research efforts in this area, we release publicly available benchmarks, datasets, and simulators for network-level coordination alongside an open-source codebase designed to provide accessible simulation platforms and establish a standardized validation process for comparing methodologies. Code available at: https://github.com/StanfordASL/RL4AMOD
中文:本研究提出了一种新颖的强化学习框架,用于自动驾驶按需出行系统,结合图网络与运筹学方法提升协调效率和性能,并公开了基准测试和开源工具以支持研究普及。
English: This work introduces a novel reinforcement learning framework for Autonomous Mobility-on-Demand systems that combines graph networks with operations research to enhance coordination efficiency and performance, supported by publicly released benchmarks and open-source tools.

Authors:Davide Sferrazza, Gabriele Berton, Gabriele Trivigno, Carlo Masone
Title: To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is a critical task in computer vision, traditionally enhanced by re-ranking retrieval results with image matching. However, recent advancements in VPR methods have significantly improved performance, challenging the necessity of re-ranking. In this work, we show that modern retrieval systems often reach a point where re-ranking can degrade results, as current VPR datasets are largely saturated. We propose using image matching as a verification step to assess retrieval confidence, demonstrating that inlier counts can reliably predict when re-ranking is beneficial. Our findings shift the paradigm of retrieval pipelines, offering insights for more robust and adaptive VPR systems. The code is available at https://github.com/FarInHeight/To-Match-or-Not-to-Match.
中文: 本研究证明,在现代检索系统中,视觉位置识别的重排序可能降低性能,并提出改用图像匹配作为验证步骤,通过内点数量判断何时重排序更有利。
English: This study demonstrates that re-ranking in Visual Place Recognition can degrade performance with modern retrieval systems, proposing instead to use image matching as a verification step to determine when re-ranking is beneficial based on inlier counts.

Authors:Vincenzo Petrone, Enrico Ferrentino, Pasquale Chiacchio
Title: A ROS2-based software library for inverse dynamics computation
Abstract:
Inverse dynamics computation is a critical component in robot control, planning and simulation, enabling the calculation of joint torques required to achieve a desired motion. This paper presents a ROS2-based software library designed to solve the inverse dynamics problem for robotic systems. The library is built around an abstract class with three concrete implementations: one for simulated robots and two for real UR10 and Franka robots. This contribution aims to provide a flexible, extensible, robot-agnostic solution to inverse dynamics, suitable for both simulation and real-world scenarios involving planning and control applications. The related software is available at https://github.com/unisa-acg/inverse-dynamics-solver/tree/rap.
本文提出了一种基于ROS2的软件库,为仿真和真实机器人系统提供灵活可扩展的逆动力学解决方案,并实现了针对UR10和Franka机器人的具体应用。
This paper introduces a ROS2-based software library that provides a flexible, extensible solution for computing inverse dynamics in both simulated and real robotic systems, with implementations for UR10 and Franka robots.

Authors:Hao Li, Zhenyu Liang, Ran Cheng
Title: GPU-accelerated Evolutionary Many-objective Optimization Using Tensorized NSGA-III
Abstract:
NSGA-III is one of the most widely adopted algorithms for tackling many-objective optimization problems. However, its CPU-based design severely limits scalability and computational efficiency. To address the limitations, we propose {TensorNSGA-III}, a fully tensorized implementation of NSGA-III that leverages GPU parallelism for large-scale many-objective optimization. Unlike conventional GPU-accelerated evolutionary algorithms that rely on heuristic approximations to improve efficiency, TensorNSGA-III maintains the exact selection and variation mechanisms of NSGA-III while achieving significant acceleration. By reformulating the selection process with tensorized data structures and an optimized caching strategy, our approach effectively eliminates computational bottlenecks inherent in traditional CPU-based and naïve GPU implementations. Experimental results on widely used numerical benchmarks show that TensorNSGA-III achieves speedups of up to $3629\times$ over the CPU version of NSGA-III. Additionally, we validate its effectiveness in multiobjective robotic control tasks, where it discovers diverse and high-quality behavioral solutions. Furthermore, we investigate the critical role of large population sizes in many-objective optimization and demonstrate the scalability of TensorNSGA-III in such scenarios. The source code is available at https://github.com/EMI-Group/evomo
中文: TensorNSGA-III是NSGA-III的完全张量化GPU实现,在保持原有精确机制的同时实现了高达3629倍的加速,并在大规模多目标优化中展现出卓越的可扩展性。
English: TensorNSGA-III is a fully tensorized GPU implementation of NSGA-III that maintains its exact mechanisms while achieving up to 3629× speedup and demonstrating superior scalability for large-scale many-objective optimization.

Authors:Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis
Title: Multi-Sense Embeddings for Language Models and Knowledge Distillation
Abstract:
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict
中文摘要:本文提出多义嵌入替代标准词嵌入以更好捕捉词汇语义,并通过基于义项词典的知识蒸馏方法训练出更小更高效的学生模型,在保持性能的同时显著节省空间和推理时间。
English Summary: This paper introduces multi-sense embeddings as a replacement for standard token embeddings to better capture word meanings, and proposes a knowledge distillation method using a sense dictionary to create smaller, efficient student models while maintaining performance.

Authors:Dahyun Kang, Ahmet Iscen, Eunchan Jo, Sua Choi, Minsu Cho, Cordelia Schmid
Title: Memory-Modular Classification: Learning to Generalize with Memory Replacement
Abstract:
We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.
中文: 本文提出了一种用于图像分类的记忆模块化学习器,它将知识记忆与推理分离,通过更新外部记忆无需重新训练即可泛化到新类别,并利用网络数据进行元学习,在多种分类任务中展现出稳健性能。
English: This paper introduces a memory-modular learner for image classification that decouples knowledge memorization from reasoning, enabling generalization to new classes by updating external memory without retraining and achieving robust performance across diverse tasks through meta-learning with web data.

Authors:Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Title: Latent Multimodal Reconstruction for Misinformation Detection
Abstract:
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have focused on developing datasets and methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic examples that lack real-world complexity, limiting model robustness. Meanwhile, Large Vision-Language Models (LVLMs) remain underexplored for generating diverse and realistic synthetic data for MMD. To address, we introduce "Miscaption This!", a collection of LVLM-generated miscaptioned image datasets. Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" generalize better to real-world misinformation while LAMAR achieves new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the value of LVLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
中文: 研究人员推出了由大型视觉语言模型生成的错误标注图像数据集“MisCaption This!”以及基于重构的网络“LAMAR”,二者通过提升模型泛化能力显著改进了多模态虚假信息检测,并在权威基准测试中取得了最优性能。
English: Researchers introduce "MisCaption This!", a dataset of miscaptioned images generated by Large Vision-Language Models, and "LAMAR", a reconstruction-based network that together enhance multimodal misinformation detection by improving model generalization and achieving state-of-the-art results on benchmarks.

Authors:Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Title: Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?
Abstract:
Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer. Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing. All generated configurations are publicly available in the LEMUR Neural Network Dataset (https://github.com/ABrain-One/nn-dataset), which serves as an open source benchmark for hyperparameter optimization research and provides a practical resource to improve training efficiency in image manipulation systems.
中文摘要:本研究通过LoRA微调Code Llama开发出高效的基于大语言模型的超参数优化器,在保持竞争力的均方根误差同时显著降低计算开销,为图像分类、检测和分割等任务提供了优于传统方法的调优方案。
English Summary: This study fine-tunes Code Llama with LoRA to create an efficient LLM-based hyperparameter optimizer that outperforms traditional methods like Optuna and TPE in computer vision tasks while reducing computational costs.

Authors:Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi
Title: An Empirical Study of GPT-4o Image Generation Capabilities
Abstract:
The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling. For a high-definition version of the PDF, please refer to the link on GitHub: \href{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}.
中文: 本研究对GPT-4o的图像生成能力进行了实证评估,通过与其他模型对比分析,揭示了其优势与局限,并为未来统一生成模型的架构设计指明了发展方向。
English: This study empirically evaluates GPT-4o's image generation capabilities across multiple tasks, benchmarking it against other models to identify strengths, limitations, and future directions for unified generative architectures.

Authors:Kuntian Zhang, Simin Yu, Yaoshu Wang, Makoto Onizuka, Chuan Xiao
Title: CKGAN: Training Generative Adversarial Networks Using Characteristic Kernel Integral Probability Metrics
Abstract:
In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.
中文: 本文提出CKGAN,一种基于特征核积分概率度量的生成对抗网络变体,通过将生成图像映射回随机噪声缓解模式崩溃问题,并采用软选择方法自动学习最优核函数,在多个基准数据集上优于其他基于MMD的GAN模型。
English: This paper introduces CKGAN, a GAN variant using characteristic kernel-based integral probability metrics to mitigate mode collapse and automatically learn optimal kernels, demonstrating superior performance over other MMD-based GANs on benchmark datasets.

Authors:Peerat Limkonchotiwat, Kanruethai Masuk, Surapon Nonesung, Chalermpun Mai-On, Sarana Nutanong, Wuttikorn Ponwitayarat, Potsawee Manakul
Title: Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation
Abstract:
Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency
中文:大型语言模型在泰语方言中的表现远不及标准泰语,仅有GPT-4o和Gemini2等专有模型展现出一定流畅性,这凸显了对小众语言进行更全面评估和改进的必要性。
English: Large language models exhibit significant performance drops in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 showing some fluency, highlighting the need for better evaluation and improvement in underrepresented languages.

Authors:Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li
Title: HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
Abstract:
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$\times$ in the prefill stage and 1.70$\times$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.
中文:混合专家(MoE)架构在不显著增加计算成本的情况下提升了模型能力,但面临内存限制;提出的HybriMoE框架通过动态CPU-GPU调度和缓存管理,有效提升了推理效率。
English: The Mixture of Experts (MoE) architecture increases model capacity without proportional computational cost but faces memory constraints, which the proposed HybriMoE framework addresses through dynamic CPU-GPU scheduling and cache management to enhance inference efficiency.

Authors:Toby van Gastelen, Wouter Edeling, Benjamin Sanderse
Title: Energy-Conserving Neural Network Closure Model for Long-Time Accurate and Stable LES
Abstract:
Machine learning-based closure models for LES have shown promise in capturing complex turbulence dynamics but often suffer from instabilities and physical inconsistencies. In this work, we develop a novel skew-symmetric neural architecture as closure model that enforces stability while preserving key physical conservation laws. Our approach leverages a discretization that ensures mass, momentum, and energy conservation, along with a face-averaging filter to maintain mass conservation in coarse-grained velocity fields. We compare our model against several conventional data-driven closures (including unconstrained convolutional neural networks), and the physics-based Smagorinsky model. Performance is evaluated on decaying turbulence and Kolmogorov flow for multiple coarse-graining factors. In these test cases we observe that unconstrained machine learning models suffer from numerical instabilities. In contrast, our skew-symmetric model remains stable across all tests, though at the cost of increased dissipation. Despite this trade-off, we demonstrate that our model still outperforms the Smagorinsky model in unseen scenarios. These findings highlight the potential of structure-preserving machine learning closures for reliable long-time LES.
中文: 本研究提出的斜对称神经网络闭合模型在确保数值稳定性和物理守恒律的同时,尽管耗散略有增加,但在未知场景中仍优于传统模型。
English: The proposed skew-symmetric neural network closure model ensures numerical stability and preserves physical conservation laws, outperforming conventional models in unseen scenarios despite slightly increased dissipation.

Authors:Junxi Chen, Junhao Dong, Xiaohua Xie
Title: Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
Abstract:
Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.
中文: 研究发现,将图像提示适配器(IP-Adapter)集成到文本到图像扩散模型中会引发劫持攻击,攻击者可通过难以察觉的对抗样本破坏图像生成服务并损害公众信任,同时探讨了结合对抗训练模型等防御措施来应对此威胁。
English: The study reveals that integrating the Image Prompt Adapter (IP-Adapter) into text-to-image diffusion models enables a hijacking attack, where adversaries can use imperceptible adversarial examples to compromise image generation services and undermine public trust, with defenses explored to mitigate this threat.

Authors:Shiao Wang, Xiao Wang, Bo Jiang, Lin Zhu, Guoqi Li, Yaowei Wang, Yonghong Tian, Jin Tang
Title: Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset
Abstract:
Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2
中文: 本研究提出了一种新颖的多模态框架MMHCO-HAR,通过结合RGB和事件相机来提升人类活动识别能力,并基于HARDVS 2.0数据集和全面实验验证了其有效性。
English: This study introduces a novel multi-modal framework, MMHCO-HAR, combining RGB and event cameras to enhance human activity recognition, supported by the HARDVS 2.0 dataset and validated through comprehensive experiments.

Authors:Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
Title: Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
Abstract:
Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (\ours), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By continuously minimizing the predictive entropy of LLMs on unlabeled questions in a latent semantic space, \ours achieves competitive performance compared to supervised counterparts on both mathematical and free-form natural reasoning tasks. Specifically, without any supervised signals, \ours boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary experiments and analysis are also provided to interpret the effectiveness of \ours. Code is available at https://github.com/QingyangZhang/EMPO.
中文: 本文提出熵最小化策略优化(EMPO),这是一种完全无监督的方法,通过在潜在语义空间持续最小化大语言模型对未标注问题的预测熵,无需外部监督即可在数学和自由形式自然推理任务上取得与监督方法相媲美的性能。
English: This paper introduces Entropy Minimized Policy Optimization (EMPO), a fully unsupervised method that enhances large language models' reasoning by minimizing their predictive entropy on unlabeled questions, achieving competitive results on mathematical and natural reasoning tasks without external supervision.

Authors:Seongmin Park, Mincheol Yoon, Hye-young Kim, Jongwuk Lee
Title: Why is Normalization Necessary for Linear Recommenders?
Abstract:
Despite their simplicity, linear autoencoder (LAE)-based models have shown comparable or even better performance with faster inference speed than neural recommender models. However, LAEs face two critical challenges: (i) popularity bias, which tends to recommend popular items, and (ii) neighborhood bias, which overly focuses on capturing local item correlations. To address these issues, this paper first analyzes the effect of two existing normalization methods for LAEs, i.e., random-walk and symmetric normalization. Our theoretical analysis reveals that normalization highly affects the degree of popularity and neighborhood biases among items. Inspired by this analysis, we propose a versatile normalization solution, called Data-Adaptive Normalization (DAN), which flexibly controls the popularity and neighborhood biases by adjusting item- and user-side normalization to align with unique dataset characteristics. Owing to its model-agnostic property, DAN can be easily applied to various LAE-based models. Experimental results show that DAN-equipped LAEs consistently improve existing LAE-based models across six benchmark datasets, with significant gains of up to 128.57% and 12.36% for long-tail items and unbiased evaluations, respectively. Refer to our code in https://github.com/psm1206/DAN.
中文: 本文提出数据自适应归一化方法(DAN),通过动态调整归一化策略以适应数据集特性,有效缓解线性自编码推荐模型中的流行度偏差和邻域偏差,在长尾项目推荐和无偏评估中实现了显著性能提升。
English: This paper introduces Data-Adaptive Normalization (DAN), a model-agnostic method that effectively mitigates popularity and neighborhood biases in linear autoencoder-based recommender systems by dynamically adjusting normalization to fit dataset characteristics, achieving significant performance improvements in long-tail item recommendations and unbiased evaluations.

Authors:Yiming Tang, Yi Fan, Chenxiao Yu, Tiankai Yang, Yue Zhao, Xiyang Hu
Title: StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
Abstract:
The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present $\textbf{StealthRank}$, a novel adversarial attack method that manipulates LLM-driven ranking systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within item or document descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target items while avoiding explicit manipulation traces. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven ranking systems. Our code is publicly available at $\href{https://github.com/Tangyiming205069/controllable-seo}{here}$.
中文: 本文提出StealthRank攻击方法,通过基于能量的优化框架和朗之万动力学生成隐蔽提示,能在保持文本流畅性的同时有效操纵大语言模型驱动的排序系统,暗中提升目标条目排名且避免被检测到。
English: This paper introduces StealthRank, an adversarial attack method that manipulates LLM-driven ranking systems through energy-based optimization and Langevin dynamics to generate stealthy prompts, effectively boosting target items' rankings while maintaining textual fluency and evading detection.

Authors:Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang
Title: MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Abstract:
Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.
中文: MDK12-Bench是一个基于K-12真实考题构建的多学科基准测试,通过大规模数据标注和动态评估框架揭示了当前多模态大语言模型在推理能力上的显著不足。
English: MDK12-Bench is a comprehensive multi-disciplinary benchmark using K-12 exam questions to evaluate multimodal reasoning in MLLMs, revealing significant limitations in current models through its large-scale dataset and dynamic evaluation framework.

Authors:Luigi Rovito, Marco Virgolin
Title: Interpretable Non-linear Survival Analysis with Evolutionary Symbolic Regression
Abstract:
Survival Regression (SuR) is a key technique for modeling time to event in important applications such as clinical trials and semiconductor manufacturing. Currently, SuR algorithms belong to one of three classes: non-linear black-box -- allowing adaptability to many datasets but offering limited interpretability (e.g., tree ensembles); linear glass-box -- being easier to interpret but limited to modeling only linear interactions (e.g., Cox proportional hazards); and non-linear glass-box -- allowing adaptability and interpretability, but empirically found to have several limitations (e.g., explainable boosting machines, survival trees). In this work, we investigate whether Symbolic Regression (SR), i.e., the automated search of mathematical expressions from data, can lead to non-linear glass-box survival models that are interpretable and accurate. We propose an evolutionary, multi-objective, and multi-expression implementation of SR adapted to SuR. Our empirical results on five real-world datasets show that SR consistently outperforms traditional glass-box methods for SuR in terms of accuracy per number of dimensions in the model, while exhibiting comparable accuracy with black-box methods. Furthermore, we offer qualitative examples to assess the interpretability potential of SR models for SuR. Code at: https://github.com/lurovi/SurvivalMultiTree-pyNSGP.
中文:符号回归被提出作为生存回归的新型非线性透明盒方法,在保持与黑盒模型相媲美的准确性的同时,其精度优于传统可解释方法,并展现出更强的可解释性潜力。
English: Symbolic Regression is proposed as a novel non-linear glass-box method for Survival Regression, demonstrating superior accuracy over traditional interpretable models while maintaining competitive performance with black-box approaches and offering enhanced interpretability.

Authors:Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, Maosong Sun
Title: LLM$\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
Abstract:
Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLM$\times$MapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLM$\times$MapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines. Both LLM$\times$MapReduce-V2 and SurveyEval are publicly available at https://github.com/thunlp/LLMxMapReduce .
中文摘要:本文提出LLM×MapReduce-V2这一新型测试时扩展策略,通过堆叠卷积层逐步扩展对输入材料的理解,显著增强大语言模型处理超长输入并生成连贯长文本的能力。
English Summary: This paper introduces LLM×MapReduce-V2, a novel test-time scaling strategy that enhances large language models' capacity to process extremely long inputs and generate coherent long-form articles by progressively expanding understanding through stacked convolutional layers.

Authors:Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen
Title: POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction
Abstract:
3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.
中文: POMATO 是一个将点云匹配与时间运动相结合的统一框架,通过确保尺度一致性和精确几何关系,显著提升了动态三维重建在深度估计和三维点跟踪等任务中的性能。
English: POMATO is a unified framework that integrates pointmap matching with temporal motion to enhance dynamic 3D reconstruction, improving performance in tasks like depth estimation and 3D point tracking by ensuring scale consistency and precise geometry.

Authors:Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov
Title: kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization
Abstract:
Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc
中文摘要:本文通过加性合成和新距离度量方法,提升了kNN-VC框架在零样本歌声转换中的谐波质量和流畅度。
English Summary: This paper enhances zero-shot singing voice conversion by improving harmonic quality and smoothness in the kNN-VC framework through additive synthesis and a novel distance metric.

Authors:Shunsuke Sakai, Shunsuke Tsuge, Tatsuhito Hasegawa
Title: Noisy Deep Ensemble: Accelerating Deep Ensemble Learning via Noise Injection
Abstract:
Neural network ensembles is a simple yet effective approach for enhancing generalization capabilities. The most common method involves independently training multiple neural networks initialized with different weights and then averaging their predictions during inference. However, this approach increases training time linearly with the number of ensemble members. To address this issue, we propose the novel ``\textbf{Noisy Deep Ensemble}'' method, significantly reducing the training time required for neural network ensembles. In this method, a \textit{parent model} is trained until convergence, and then the weights of the \textit{parent model} are perturbed in various ways to construct multiple \textit{child models}. This perturbation of the \textit{parent model} weights facilitates the exploration of different local minima while significantly reducing the training time for each ensemble member. We evaluated our method using diverse CNN architectures on CIFAR-10 and CIFAR-100 datasets, surpassing conventional efficient ensemble methods and achieving test accuracy comparable to standard ensembles. Code is available at \href{https://github.com/TSTB-dev/NoisyDeepEnsemble}{https://github.com/TSTB-dev/NoisyDeepEnsemble}
Chinese: 提出的“噪声深度集成”方法通过扰动已收敛父模型的权重来创建子模型,显著减少了神经网络集成的训练时间,在保持与标准集成相当精度的同时提高了效率。
English: The proposed "Noisy Deep Ensemble" method reduces training time for neural network ensembles by perturbing a converged parent model's weights to create child models, achieving comparable accuracy to standard ensembles with improved efficiency.

Authors:Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa
Title: Reconstruction-Free Anomaly Detection with Diffusion Models
Abstract:
Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose a novel inversion-based AD approach - detection via noising in latent space - which circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing detection via denoising in RGB space paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we only enforce very few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analysis across three widely used image AD datasets under the unsupervised unified setting to demonstrate the effectiveness of our model, regarding state-of-the-art AD performance, and about 2 times inference time speedup without diffusion distillation.
中文: 本文提出了一种新颖的基于反转的异常检测方法,通过在潜在空间中对图像加噪来避免显式重建,在保持高精度的同时实现了约两倍的推理加速,并达到了最先进的检测性能。
English: This paper introduces a novel inversion-based anomaly detection method that avoids explicit reconstruction by noising images in latent space, achieving state-of-the-art performance with approximately twice the inference speed while maintaining high accuracy.

Authors:Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
Title: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Abstract:
Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.
中文摘要:提出的嵌套Res2Net(Nes2Net)架构无需降维即可直接处理高维语音特征,在多个语音伪造检测数据集中实现了显著性能提升和计算成本节约。
English Summary: The proposed Nested Res2Net (Nes2Net) architecture directly processes high-dimensional speech features without dimensionality reduction, achieving significant performance improvements and computational savings across multiple speech spoofing detection datasets.

Authors:Yan Zhang, Zhong Ji, Changxu Meng, Yanwei Pang, Jungong Han
Title: iEBAKER: Improved Remote Sensing Image-Text Retrieval Framework via Eliminate Before Align and Keyword Explicit Reasoning
Abstract:
Recent studies focus on the Remote Sensing Image-Text Retrieval (RSITR), which aims at searching for the corresponding targets based on the given query. Among these efforts, the application of Foundation Models (FMs), such as CLIP, to the domain of remote sensing has yielded encouraging outcomes. However, existing FM based methodologies neglect the negative impact of weakly correlated sample pairs and fail to account for the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose an approach named iEBAKER (an Improved Eliminate Before Align strategy with Keyword Explicit Reasoning framework) for RSITR. Specifically, we propose an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs, thereby mitigating their deviations from optimal embedding space during alignment.Further, two specific schemes are introduced from the perspective of whether local similarity and global similarity affect each other. On this basis, we introduce an alternative Sort After Reversed Retrieval (SAR) strategy, aims at optimizing the similarity matrix via reverse retrieval. Additionally, we incorporate a Keyword Explicit Reasoning (KER) module to facilitate the beneficial impact of subtle key concept distinctions. Without bells and whistles, our approach enables a direct transition from FM to RSITR task, eliminating the need for additional pretraining on remote sensing data. Extensive experiments conducted on three popular benchmark datasets demonstrate that our proposed iEBAKER method surpasses the state-of-the-art models while requiring less training data. Our source code will be released at https://github.com/zhangy0822/iEBAKER.
中文:提出的iEBAKER框架通过"先剔除后对齐"策略过滤弱相关样本对,并结合关键词显式推理模块增强文本区分度,在无需额外预训练的情况下实现了遥感图文检索的最优性能。
English: The proposed iEBAKER framework addresses limitations in Remote Sensing Image-Text Retrieval by introducing an Eliminate Before Align strategy to filter weakly correlated samples and a Keyword Explicit Reasoning module to enhance text distinction, achieving state-of-the-art performance without additional pretraining.

Authors:Igor Polyakov, Alexey Dukhanov, Egor Spirin
Title: TAGC: Optimizing Gradient Communication in Distributed Transformer Training
Abstract:
The increasing complexity of large language models (LLMs) necessitates efficient training strategies to mitigate the high computational costs associated with distributed training. A significant bottleneck in this process is gradient synchronization across multiple GPUs, particularly in the zero-redundancy parallelism mode. In this paper, we introduce Transformer-Aware Gradient Compression (TAGC), an optimized gradient compression algorithm designed specifically for transformer-based models. TAGC extends the lossless homomorphic compression method by adapting it for sharded models and incorporating transformer-specific optimizations, such as layer-selective compression and dynamic sparsification. Our experimental results demonstrate that TAGC accelerates training by up to 15% compared to the standard Fully Sharded Data Parallel (FSDP) approach, with minimal impact on model quality. We integrate TAGC into the PyTorch FSDP framework, the implementation is publicly available at https://github.com/ipolyakov/TAGC.
中文: 本文提出TAGC算法,这是一种针对Transformer模型的梯度压缩技术,可将大型语言模型的分布式训练速度提升高达15%,且对模型质量影响极小,现已集成至PyTorch FSDP框架。
English: This paper introduces TAGC, a transformer-aware gradient compression algorithm that accelerates distributed training of large language models by up to 15% with minimal quality loss, now integrated into PyTorch FSDP.

Authors:Xiao Zhang, Xiangyu Han, Xiwen Lai, Yao Sun, Pei Zhang, Konrad Kording
Title: Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
Abstract:
Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both mask generation speed and segmentation accuracy. We present a regularized fractional alternating cut (Falcon), an optimization-based K-way Normalized Cut without relying on recursive eigenvector computations, achieving substantially improved speed and accuracy. Falcon operates in two stages: (1) a fast K-way Normalized Cut solved by extending into a fractional quadratic transformation, with an alternating iterative procedure and regularization to avoid local minima; and (2) refinement of the resulting masks using complementary low-level information, producing high-quality pixel-level segmentations. Experiments show that Falcon not only surpasses existing state-of-the-art methods by an average of 2.5% across six widely recognized benchmarks (reaching up to 4.3\% improvement on Cityscapes), but also reduces runtime by around 30% compared to prior graph-based approaches. These findings demonstrate that the semantic information within foundation-model attention can be effectively harnessed by a highly parallelizable graph cut framework. Consequently, Falcon can narrow the gap between unsupervised and supervised segmentation, enhancing scalability in real-world applications and paving the way for dense prediction-based vision pre-training in various downstream tasks. The code is released in https://github.com/KordingLab/Falcon.
中文摘要:Falcon提出了一种基于正则化分数交替割的快速无监督图像分割方法,无需递归计算即可实现更优的速度和精度,显著缩小了与有监督方法之间的差距。
English Summary: Falcon introduces a faster and more accurate unsupervised image segmentation method using a regularized fractional alternating cut that avoids recursive computations, achieving significant improvements in both speed and accuracy over existing approaches.

Authors:Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram
Title: DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Abstract:
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.
推测解码通过使用草稿模型高效生成多个候选标记,再由目标模型并行验证,从而在不降低生成质量的前提下加速大语言模型的推理过程。
Speculative Decoding accelerates large language model inference by using a draft model to propose tokens and the target model to verify them in parallel, maintaining quality while increasing speed.

Authors:Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang
Title: Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
Abstract:
Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.
中文: UnifyEdit是一种免调优方法,通过自注意力和交叉注意力约束优化扩散潜变量,并采用自适应调度器动态平衡二者,在文本图像编辑中实现了保真度与可编辑性的卓越统一。
English: UnifyEdit is a tuning-free method that optimizes diffusion latents using self-attention and cross-attention constraints, dynamically balanced by an adaptive scheduler to achieve superior fidelity and editability in text-based image editing.

Authors:Long Ma, Yuxin Feng, Yan Zhang, Jinyuan Liu, Weimin Wang, Guang-Yong Chen, Chengpei Xu, Zhuo Su
Title: CoA: Towards Real Image Dehazing via Compression-and-Adaptation
Abstract:
Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges from a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach's superiority and effectiveness. The code is publicly available at https://github.com/fyxnl/COA.
中文: 本文提出压缩与适应(CoA)计算流程,通过在合成域进行模型压缩并在真实域实施双层适应,有效解决真实图像去雾的效率与适应性难题。
English: This paper introduces a Compression-and-Adaptation (CoA) computational flow that combines model compression in synthetic domains with bilevel adaptation in real environments to efficiently and effectively address real image dehazing challenges.

Authors:Yunlong Tang, Jing Bi, Chao Huang, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, Junjia Guo, Zhuo Liu, Luchuan Song, Ali Vosoughi, Jinxi He, Liu He, Zeliang Zhang, Jiebo Luo, Chenliang Xu
Title: Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Abstract:
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at https://github.com/yunlong10/CAT-V
Chinese: CAT-V是一个无需训练的细粒度对象中心视频描述框架,通过整合分割、时序分析和描述生成模块,利用时空视觉提示生成详细且时间敏感的对象描述,支持多种用户交互方式。
English: CAT-V is a training-free framework for fine-grained object-centric video captioning that integrates segmentation, temporal analysis, and description generation to produce detailed, temporally-aware object descriptions through flexible user prompts.

Authors:P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin
Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Abstract:
Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.
Chinese Summary: 本研究通过全自动LLM标注流程构建了大规模中文偏好数据集COIG-P,有效解决了现有数据集规模小、领域窄的问题,实验证明其能显著提升模型性能并训练出高效的中文奖励模型。
English Summary: The study introduces COIG-P, a large-scale Chinese preference dataset created using an automated LLM-based pipeline to overcome limitations of existing datasets, and demonstrates its effectiveness through improved model performance and a robust reward model.

Authors:Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty
Title: ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering
Abstract:
Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types, including infographics and dashboards, and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at https://github.com/vis-nlp/ChartQAPro.
中文: 针对现有图表问答基准的不足,ChartQAPro引入了包含1341个多样化图表和1948个复杂问题的新数据集,揭示了大型视觉语言模型性能显著下降,并指出了提升图表推理能力的关键挑战。
English: To address the limitations of existing chart question answering benchmarks, ChartQAPro introduces a diverse dataset of 1,341 charts and 1,948 complex questions, revealing significant performance drops in large vision-language models and highlighting key challenges for advancing chart reasoning capabilities.

Authors:Marija Ivanovska, Leon Todorov, Naser Damer, Deepak Kumar Jain, Peter Peer, Vitomir Å truc
Title: SelfMAD: Enhancing Generalization and Robustness in Morphing Attack Detection via Self-Supervised Learning
Abstract:
With the continuous advancement of generative models, face morphing attacks have become a significant challenge for existing face verification systems due to their potential use in identity fraud and other malicious activities. Contemporary Morphing Attack Detection (MAD) approaches frequently rely on supervised, discriminative models trained on examples of bona fide and morphed images. These models typically perform well with morphs generated with techniques seen during training, but often lead to sub-optimal performance when subjected to novel unseen morphing techniques. While unsupervised models have been shown to perform better in terms of generalizability, they typically result in higher error rates, as they struggle to effectively capture features of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel self-supervised approach that simulates general morphing attack artifacts, allowing classifiers to learn generic and robust decision boundaries without overfitting to the specific artifacts induced by particular face morphing methods. Through extensive experiments on widely used datasets, we demonstrate that SelfMAD significantly outperforms current state-of-the-art MADs, reducing the detection error by more than 64% in terms of EER when compared to the strongest unsupervised competitor, and by more than 66%, when compared to the best performing discriminative MAD model, tested in cross-morph settings. The source code for SelfMAD is available at https://github.com/LeonTodorov/SelfMAD.
Chinese: 本文提出SelfMAD,一种自监督方法,通过模拟通用的形态攻击伪影来提升对新型人脸形态攻击的检测能力,在跨形态测试中显著优于现有模型,将检测错误率降低了64%以上。
English: The paper introduces SelfMAD, a self-supervised method that simulates general morphing attack artifacts to enhance the detection of novel face morphing techniques, significantly outperforming existing models by reducing detection errors by over 64% in cross-morph settings.

Authors:Ruoyu Xue, Jingyi Xu, Sounak Mondal, Hieu Le, Gregory Zelinsky, Minh Hoai, Dimitris Samaras
Title: Few-shot Personalized Scanpath Prediction
Abstract:
A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath
中文: 本文提出了一种基于主体嵌入网络(SE-Net)的小样本个性化扫描路径预测方法,通过生成个体化表征,仅需少量数据即可为新主体实现精准的扫描路径预测,且无需测试时微调。
English: This paper introduces a few-shot personalized scanpath prediction (FS-PSP) method using a Subject-Embedding Network (SE-Net) to generate individualized representations, enabling accurate scanpath predictions for new subjects with minimal data and no test-time fine-tuning.

Authors:Arnas Uselis, Seong Joon Oh
Title: Intermediate Layer Classifiers for OOD generalization
Abstract:
Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textit{Intermediate Layer Classifiers} (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/intermediate-layer-generalization
中文: 中间层分类器通常比倒数第二层提供显著更好的分布外泛化能力,因为它们对多种数据集和架构中的数据分布变化较不敏感。
English: Intermediate layer classifiers often provide significantly better out-of-distribution generalization than penultimate layers, as they are less sensitive to data distribution shifts across various datasets and architectures.

Authors:Ziad Kheil, Soleakhena Ken, Laurent Risser
Title: Biomechanical Constraints Assimilation in Deep-Learning Image Registration: Application to sliding and locally rigid deformations
Abstract:
Regularization strategies in medical image registration often take a one-size-fits-all approach by imposing uniform constraints across the entire image domain. Yet biological structures are anything but regular. Lacking structural awareness, these strategies may fail to consider a panoply of spatially inhomogeneous deformation properties, which would faithfully account for the biomechanics of soft and hard tissues, especially in poorly contrasted structures. To bridge this gap, we propose a learning-based image registration approach in which the inferred deformation properties can locally adapt themselves to trained biomechanical characteristics. Specifically, we first enforce in the training process local rigid displacements, shearing motions or pseudo-elastic deformations using regularization losses inspired from the field of solid-mechanics. We then show on synthetic and real 3D thoracic and abdominal images that these mechanical properties of different nature are well generalized when inferring the deformations between new image pairs. Our approach enables neural-networks to infer tissue-specific deformation patterns directly from input images, ensuring mechanically plausible motion. These networks preserve rigidity within hard tissues while allowing controlled sliding in regions where tissues naturally separate, more faithfully capturing physiological motion. The code is publicly available at https://github.com/Kheil-Z/biomechanical_DLIR .
中文摘要:现有医学图像配准方法采用统一约束而忽略组织间生物力学差异,为此我们提出一种基于学习的方法,能局部自适应变形属性以更真实地模拟组织运动特性。
English Summary: Current medical image registration methods apply uniform constraints that overlook the biomechanical differences between tissues, so we developed a learning-based approach that adapts deformation properties locally to better mimic real tissue behavior.

Authors:Yue Yao, Mohamed-Khalil Bouzidi, Daniel Goehring, Joerg Reichardt
Title: EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations
Abstract:
As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints are available at: https://github.com/continental/EP-Diffuser.
Chinese: EP-Diffuser是一种参数高效的扩散模型,能够生成多样且合理的交通场景预测,尽管模型规模较小,却在准确性和鲁棒性上超越了现有最优模型。
English: EP-Diffuser is a parameter-efficient diffusion model that generates diverse and plausible traffic scene predictions, achieving superior accuracy and robustness compared to state-of-the-art models despite its smaller size.

Authors:Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet
Title: Time-adaptive Video Frame Interpolation based on Residual Diffusion
Abstract:
In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.
Chinese: 本研究提出了一种针对手绘动画的扩散式视频帧插值新方法,具备时间感知插值、高效低步数扩散方案及像素级不确定性估计三大创新,显著提升了动画视频的插值性能。
English: This study introduces a novel diffusion-based video frame interpolation method tailored for hand-drawn animation, featuring time-aware interpolation, an efficient diffusion scheme requiring minimal steps, and pixel-level uncertainty estimation to enhance reliability.

Authors:Xueqiao Zhang, Chao Zhang, Jianwen Sun, Jun Xiao, Yi Yang, Yawei Luo
Title: EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design
Abstract:
Large Language Models (LLMs) have significantly advanced smart education in the Artificial General Intelligence (AGI) era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: (1) Customized Generation: generating niche-targeted teaching content based on students' varying learning abilities and states, and (2) Intelligent Optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multi-agent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. Additionally, we introduce the CIDDP, an LLM-based five-dimensional evaluation module encompassing clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework. Our code is publicly available at https://github.com/Zc0812/Edu_Planner
中文: EduPlanner是一个基于大语言模型的多智能体系统,通过技能树结构和五维评估模块为数学课程定制并优化教学设计,实验验证了其有效性。
English: EduPlanner is a multi-agent LLM system that customizes and optimizes instructional designs using a Skill-Tree structure and a five-dimensional evaluation module, demonstrated effectively in mathematics education.

Authors:Junghun Oh, Sungyong Baik, Kyoung Mu Lee
Title: Find A Winning Sign: Sign Is All We Need to Win the Lottery
Abstract:
The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git
Chinese: 该研究表明,在随机初始化的网络中保留参数符号配置和稀疏性,通过线性连通性维持低误差障碍,能够使其达到与完全训练的稀疏网络相媲美的性能。
English: The study reveals that preserving parameter sign configurations and sparsity in randomly initialized networks enables them to achieve performance comparable to fully trained sparse networks by maintaining low error barriers through linear connectivity.

Authors:Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xu-Yao Zhang
Title: Achieving binary weight and activation for LLMs using Post-Training Quantization
Abstract:
Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.
中文摘要:本文提出了一种新颖的训练后量化框架,通过细粒度分组和误差减少技术,在1位权重和激活下实现卓越性能,推动了全二值化大语言模型的发展。
English Summary: This paper introduces a novel post-training quantization framework that achieves superior performance with 1-bit weights and activations through fine-grained grouping and error-reduction techniques, advancing toward fully binarized LLMs.

Authors:Hao Nan Sheng, Zhi-yong Wang, Mingrui Yang, Hing Cheung So
Title: AROMA: Autonomous Rank-one Matrix Adaptation
Abstract:
As large language models continue to grow in size, parameter-efficient fine-tuning (PEFT) has become increasingly crucial. While low-rank adaptation (LoRA) offers a solution through low-rank updates, its static rank allocation may yield suboptimal results. Adaptive low-rank adaptation (AdaLoRA) improves this with dynamic allocation but remains sensitive to initial and target rank configurations. We introduce AROMA, a framework that automatically constructs layer-specific updates by iteratively building up rank-one components with very few trainable parameters that gradually diminish to zero. Unlike existing methods that employ rank reduction mechanisms, AROMA introduces a dual-loop architecture for rank growth. The inner loop extracts information from each rank-one subspace, while the outer loop determines the number of rank-one subspaces, i.e., the optimal rank. We reset optimizer states to maintain subspace independence. AROMA significantly reduces parameters compared to LoRA and AdaLoRA while achieving superior performance on natural language understanding and commonsense reasoning tasks, offering new insights into adaptive PEFT. The code is available at \href{https://github.com/ShuDun23/AROMA}{AROMA}.
Chinese: AROMA提出了一种双循环框架,通过迭代构建极少可训练参数的秩一组件来自动生成层级特定更新,相比LoRA和AdaLoRA在语言任务中显著提升了参数效率与性能表现。
English: AROMA introduces a dual-loop framework that dynamically constructs layer-specific updates by iteratively building rank-one components with minimal trainable parameters, significantly outperforming LoRA and AdaLoRA in efficiency and performance on language tasks.

Authors:Changyu Du, Zihan Deng, Stavros Nousias, André Borrmann
Title: Predictive Modeling: BIM Command Recommendation Based on Large-scale Usage Logs
Abstract:
The adoption of Building Information Modeling (BIM) and model-based design within the Architecture, Engineering, and Construction (AEC) industry has been hindered by the perception that using BIM authoring tools demands more effort than conventional 2D drafting. To enhance design efficiency, this paper proposes a BIM command recommendation framework that predicts the optimal next actions in real-time based on users' historical interactions. We propose a comprehensive filtering and enhancement method for large-scale raw BIM log data and introduce a novel command recommendation model. Our model builds upon the state-of-the-art Transformer backbones originally developed for large language models (LLMs), incorporating a custom feature fusion module, dedicated loss function, and targeted learning strategy. In a case study, the proposed method is applied to over 32 billion rows of real-world log data collected globally from the BIM authoring software Vectorworks. Experimental results demonstrate that our method can learn universal and generalizable modeling patterns from anonymous user interaction sequences across different countries, disciplines, and projects. When generating recommendations for the next command, our approach achieves a Recall@10 of approximately 84%. The code is available at: https://github.com/dcy0577/BIM-Command-Recommendation.git
中文: 本文提出了一种BIM命令推荐框架,通过基于Transformer的模型分析大规模用户交互数据来预测最优操作指令,在实际测试中Recall@10达到84%,有效提升了设计效率。
English: This paper introduces a BIM command recommendation framework that enhances design efficiency by predicting optimal next actions using a Transformer-based model trained on large-scale user interaction data, achieving 84% Recall@10 in real-world testing.

Authors:Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim
Title: URECA: Unique Region Caption Anything
Abstract:
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.
Chinese: URECA数据集和模型通过多粒度方法解决了现有区域级描述方法的局限性,利用精细化的数据筛选流程和增强的空间编码技术确保独特且一致的描述,实现了最先进的性能表现。
English: The URECA dataset and model address the limitations of existing region-level captioning methods by introducing a multi-granularity approach that ensures unique and consistent captions through a refined data curation pipeline and enhanced spatial encoding techniques, achieving state-of-the-art performance.

Authors:Hansheng Chen, Kai Zhang, Hao Tan, Zexiang Xu, Fujun Luan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Title: Gaussian Mixture Flow Matching Models
Abstract:
Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.
中文摘要:提出的GMFlow模型通过预测高斯混合参数来更精确地捕捉速度分布,改进了扩散模型和流匹配方法,其新型概率引导方案实现了更优的少步采样效果并有效缓解了色彩过饱和问题。
English Summary: The proposed GMFlow model improves upon diffusion and flow matching methods by predicting Gaussian mixture parameters for more accurate velocity distributions, enabling superior few-step sampling and reducing color over-saturation through a novel probabilistic guidance scheme.

Authors:Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, John Langford
Title: Dion: Distributed Orthonormalized Updates
Abstract:
Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded weights in large-scale LLM training, causing high compute and communication cost. We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule that replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer, avoiding full-matrix reconstruction and integrating cleanly with weight sharding. The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings. On language models from 160M to 3B parameters, Dion retains the benefits of orthonormalized updates, while markedly reducing wall-clock time at scale, making it a practical optimizer for next-generation foundation models. Code is available at: https://github.com/microsoft/dion/
中文: Dion提出了一种可扩展的分布式正交化方法,通过基于动量的迭代替代密集矩阵运算,在保持更新质量的同时显著提升了大规模语言模型训练的效率和稳定性。
English: Dion introduces a scalable distributed orthonormalization method that replaces dense matrix operations with efficient momentum-based iterations, enabling faster and more stable training of large language models while maintaining update quality.

Authors:Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
Title: Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic
Abstract:
Large language models (LLMs) achieve impressive results on advanced mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising the question of whether they have truly grasped fundamental arithmetic rules or are merely relying on pattern matching. To unravel this issue, we systematically probe LLMs' understanding of two-integer addition ($0$ to $2^{64}$) by testing three crucial properties: commutativity ($A+B=B+A$), representation invariance via symbolic remapping (e.g., $7 \mapsto Y$), and consistent accuracy scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark disconnect: while models achieve high numeric accuracy (73.8-99.8%), they systematically fail these diagnostics. Specifically, accuracy plummets to $\le 7.5$% with symbolic inputs, commutativity is violated in up to 20% of cases, and accuracy scaling is non-monotonic. Interventions further expose this pattern-matching reliance: explicitly providing rules degrades performance by 29.49%, while prompting for explanations before answering merely maintains baseline accuracy. These findings demonstrate that current LLMs address elementary addition via pattern matching, not robust rule induction, motivating new diagnostic benchmarks and innovations in model architecture and training to cultivate genuine mathematical reasoning. Our dataset and generating code are available at https://github.com/kuri-leo/llm-arithmetic-diagnostic.
中文: 大语言模型在基础算术上数值准确率高,但在交换律和符号不变性等基本属性诊断中表现不佳,表明其依赖模式匹配而非真正的规则理解。
English: Large language models achieve high numeric accuracy in basic arithmetic but fail diagnostic tests for fundamental properties like commutativity and symbolic invariance, revealing reliance on pattern matching rather than genuine rule understanding.

Authors:Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis
Title: PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity
Abstract:
As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available at https://github.com/BurakGurbuz97/PEAKS.
中文: 本研究提出了增量数据选择(IDS)问题,并开发了PEAKS方法,通过结合特征空间几何关系与预测误差,在动态数据流场景中持续优于现有选择策略。
English: The study introduces the Incremental Data Selection (IDS) problem and proposes PEAKS, an efficient method that leverages geometric relationships and prediction errors to outperform existing strategies in dynamic data streaming contexts.

Authors:Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed
Title: A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?
Abstract:
Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.
Chinese: 医学影像中的视觉-语言预训练因数据稀缺和概念精细而面临挑战,因此回归监督式单模态预训练,证明其更具竞争力且能更好地整合异构数据。
English: Vision-language pre-training in medical imaging faces challenges due to scarce datasets and fine-grained concepts, prompting a return to supervised unimodal pre-training that proves more competitive and better integrates heterogeneous data.

Authors:Jiaming Chen, Wentao Zhao, Ziyu Meng, Donghui Mao, Ran Song, Wei Pan, Wei Zhang
Title: Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation
Abstract:
Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at https://github.com/PPjmchen/VLMPC.
中文: 研究者提出了视觉语言模型预测控制(VLMPC)及其优化版本Traj-VLMPC,通过结合视觉语言模型的感知能力与模型预测控制框架,利用视频预测或轨迹生成来规划机器人操作动作,在公开测试和实际任务中均超越了现有最优方法。
English: The authors introduce Vision-Language Model Predictive Control (VLMPC) and its enhanced variant Traj-VLMPC, which integrate vision-language models with model predictive control to enable robotic manipulation planning by generating and evaluating action sequences through video or trajectory prediction, outperforming existing methods on benchmarks and real-world tasks.

Authors:Rayan Merghani Ahmed, Adnan Iltaf, Mohamed Elmanna, Gang Zhao, Hongliang Li, Yue Du, Bin Li, Shoujun Zhou
Title: MSA-UNet3+: Multi-Scale Attention UNet3+ with New Supervised Prototypical Contrastive Loss for Coronary DSA Image Segmentation
Abstract:
Accurate segmentation of coronary Digital Subtraction Angiography images is essential to diagnose and treat coronary artery diseases. Despite advances in deep learning, challenges such as high intra-class variance and class imbalance limit precise vessel delineation. Most existing approaches for coronary DSA segmentation cannot address these issues. Also, existing segmentation network's encoders do not directly generate semantic embeddings, which could enable the decoder to reconstruct segmentation masks effectively from these well-defined features. We propose a Supervised Prototypical Contrastive Loss that fuses supervised and prototypical contrastive learning to enhance coronary DSA image segmentation. The supervised contrastive loss enforces semantic embeddings in the encoder, improving feature differentiation. The prototypical contrastive loss allows the model to focus on the foreground class while alleviating the high intra-class variance and class imbalance problems by concentrating only on the hard-to-classify background samples. We implement the proposed SPCL loss within an MSA-UNet3+: a Multi-Scale Attention-Enhanced UNet3+ architecture. The architecture integrates key components: a Multi-Scale Attention Encoder and a Multi-Scale Dilated Bottleneck designed to enhance multi-scale feature extraction and a Contextual Attention Fusion Module built to keep fine-grained details while improving contextual understanding. Experiments on a private coronary DSA dataset show that MSA-UNet3+ outperforms state-of-the-art methods, achieving the highest Dice coefficient and F1-score and significantly reducing ASD and ACD. The developed framework provides clinicians with precise vessel segmentation, enabling accurate identification of coronary stenosis and supporting informed diagnostic and therapeutic decisions. The code will be released at https://github.com/rayanmerghani/MSA-UNet3plus.
中文: 本研究提出了一种监督原型对比损失方法,结合MSA-UNet3+架构,通过解决类别不平衡和提升特征区分度来改进冠状动脉DSA图像分割,在私有数据集上实现了最优性能。
English: This study introduces a Supervised Prototypical Contrastive Loss integrated into an MSA-UNet3+ architecture to enhance coronary DSA image segmentation by addressing class imbalance and improving feature differentiation, achieving superior performance on a private dataset.

Authors:Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke
Title: Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval
Abstract:
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.
中文: 提出的直接文档相关性优化(DDRO)方法通过成对排序将标记级文档标识符生成与文档级相关性对齐,在显著提升检索效果的同时简化了优化流程,优于基于强化学习的方法。
English: The proposed direct document relevance optimization (DDRO) method aligns token-level document identifier generation with document-level relevance through pairwise ranking, outperforming reinforcement learning-based approaches with significant improvements in retrieval effectiveness while simplifying the optimization process.

Authors:Guangqiang Li, M. Amine Atoui, Xiangshun Li
Title: Attention-Based Multiscale Temporal Fusion Network for Uncertain-Mode Fault Diagnosis in Multimode Processes
Abstract:
Fault diagnosis in multimode processes plays a critical role in ensuring the safe operation of industrial systems across multiple modes. It faces a great challenge yet to be addressed - that is, the significant distributional differences among monitoring data from multiple modes make it difficult for the models to extract shared feature representations related to system health conditions. In response to this problem, this paper introduces a novel method called attention-based multiscale temporal fusion network. The multiscale depthwise convolution and gated recurrent unit are employed to extract multiscale contextual local features and long-short-term features. Instance normalization is applied to suppress mode-specific information. Furthermore, a temporal attention mechanism is designed to focus on critical time points with higher cross-mode shared information, thereby enhancing the accuracy of fault diagnosis. The proposed model is applied to Tennessee Eastman process dataset and three-phase flow facility dataset. The experiments demonstrate that the proposed model achieves superior diagnostic performance and maintains a small model size. The source code will be available on GitHub at https://github.com/GuangqiangLi/AMTFNet.
中文摘要:本文提出了一种基于注意力的多尺度时序融合网络,通过多尺度特征提取和时间注意力机制解决多模态过程中的故障诊断难题,在保持模型轻量化的同时实现了优越的诊断性能。
English Summary: This paper introduces an attention-based multiscale temporal fusion network that addresses multimode process fault diagnosis by extracting shared features through multiscale analysis and temporal attention, achieving superior performance with compact model size.

Authors:Xingyu Hu, Junjun Jiang, Chenyang Wang, Kui Jiang, Xianming Liu, Jiayi Ma
Title: Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
Abstract:
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model's generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named "TITA", which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks. The source codes are released at https://github.com/huxingyuabc/TITA.
中文: 提出的“TITA”框架通过交互增强像素注意力和操作自适应融合模块,动态平衡任务不变交互与任务特定适应,在保持多任务性能的同时显著提升了对未知融合任务的泛化能力。
English: The proposed "TITA" framework dynamically balances task-invariant interaction and task-specific adaptation through specialized modules to enhance unified image fusion performance and generalization across both known and unseen tasks.

Authors:Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu
Title: CARE: Multilingual Human Preference Learning for Cultural Awareness
Abstract:
Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce \textbf{CARE}, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with human judgments. We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs, outperforming larger generic preference data. Our analyses reveal that models with stronger initial cultural performance benefit more from alignment, leading to gaps among models developed in different regions with varying access to culturally relevant data. CARE is publicly available at https://github.com/Guochry/CARE.
中文摘要:通过引入本土文化偏好优化语言模型的偏好调整,能比使用更大规模通用数据更有效地提升其文化感知能力,CARE资源验证了这一点。
English Summary: Preference tuning for language models is enhanced by incorporating native cultural preferences, improving their cultural awareness more effectively than using larger generic datasets, as demonstrated by the CARE resource.

Authors:Suhang Gu, Ye Wang, Yongxin Chou, Jinliang Cong, Mingli Lu, Zhuqing Jiao
Title: Interpretable Style Takagi-Sugeno-Kang Fuzzy Clustering
Abstract:
Clustering is an efficient and essential technique for exploring latent knowledge of data. However, limited attention has been given to the interpretability of the clusters detected by most clustering algorithms. In addition, due to the homogeneity of data, different groups of data have their own homogeneous styles. In this paper, the above two aspects are considered, and an interpretable style Takagi-Sugeno-Kang (TSK) fuzzy clustering (IS-TSK-FC) algorithm is proposed. The clustering behavior of IS-TSK-FC is fully guided by the TSK fuzzy inference on fuzzy rules. In particular, samples are grouped into clusters represented by the corresponding consequent vectors of all fuzzy rules learned in an unsupervised manner. This can explain how the clusters are generated in detail, thus making the underlying decision-making process of the IS-TSK-FC interpretable. Moreover, a series of style matrices are introduced to facilitate the consequents of fuzzy rules in IS-TSK-FC by capturing the styles of clusters as well as the nuances between different styles. Consequently, all the fuzzy rules in IS-TSK-FC have powerful data representation capability. After determining the antecedents of all the fuzzy rules, the optimization problem of IS-TSK-FC can be iteratively solved in an alternation manner. The effectiveness of IS-TSK-FC as an interpretable clustering tool is validated through extensive experiments on benchmark datasets with unknown implicit/explicit styles. Specially, the superior clustering performance of IS-TSK-FC is demonstrated on case studies where different groups of data present explicit styles. The source code of IS-TSK-FC can be downloaded from https://github.com/gusuhang10/IS-TSK-FC.
中文: 本文提出了一种可解释的风格TSK模糊聚类算法(IS-TSK-FC),通过模糊规则和风格矩阵提升聚类可解释性,并在基准数据集上验证了其有效性。
English: This paper introduces an interpretable style TSK fuzzy clustering algorithm (IS-TSK-FC) that enhances cluster interpretability through fuzzy rules and style matrices, validated by experiments on benchmark datasets.

Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Title: State Tuning: State-based Test-Time Scaling on RWKV-7
Abstract:
Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.
中文: 本文提出状态调优这一新型测试时扩展方法,针对RWKV-7模型通过动态扩展和优化状态矩阵(无需修改预训练权重)实现性能突破,其三大核心创新——观察者框架、核函数状态扩展与去相关反向传播——使小模型在特定任务上超越大模型表现。
English: This paper introduces state tuning, a novel test-time scaling method for the RWKV-7 model that enhances performance by dynamically upscaling and optimizing the state matrix without altering pre-trained weights, achieving state-of-the-art results through three key innovations: an observer framework, kernel-based state upscaling, and Decorrelated Backpropagation.

Authors:Chandra Raskoti, Iftekharul Islam, Xuan Wang, Weizi Li
Title: MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction
Abstract:
Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors -- such as acceleration, deceleration, and left and right maneuvers -- pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.
Chinese: MIAT架构通过结合机动意图感知与时空交互建模,有效提升车辆轨迹预测精度,在真实数据集上实现短期预测性能最高提升4.7%、长期预测提升1.6%的突破。
English: The MIAT architecture enhances vehicle trajectory prediction by integrating maneuver intention awareness with spatiotemporal modeling, achieving up to 4.7% and 1.6% improvements in short- and long-horizon predictions respectively on real-world datasets.

Authors:Wang Tang, Fethiye Irmak Dogan, Linbo Qing, Hatice Gunes
Title: AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
Abstract:
Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: https://github.com/tw-repository/AsyReC.
中文: AsyReC框架通过多模态图神经网络和时序编码器,解决了二元关系建模中的三大挑战——关系不对称性、时间连续性中断和周期性行为整合,在公开数据集上实现了领先性能。
English: The AsyReC framework addresses three key challenges in dyadic relationship modeling—asymmetry, temporal continuity disruption, and periodic behavior integration—through a multimodal graph-based approach with triplet GNNs and temporal encoders, achieving state-of-the-art performance on public datasets.

Authors:Changchuan Yang, Yuhang Dong, Guanzhong Tian, Haizhou Ge, Hongrui Zhu
Title: Wavelet Policy: Imitation Policy Learning in the Scale Domain with Wavelet Transforms
Abstract:
Recent imitation learning policies, often framed as time series prediction tasks, directly map robotic observations into the action space, such as high-dimensional visual data and proprioception. When deploying at the edge, we found the underutilization of frequency domain analysis in robotic manipulation trajectory prediction leads to neglecting the inherent rhythm information embedded within action sequences, resulting in errors at critical moments. To address this, we reframe imitation learning policies through the lens of time-scale domain and introduce the Wavelet Policy. This novel approach employs wavelet transforms (WT) and new Features Extractor (FE) for feature preprocessing and extracts multi-scale features using the Single Encoder to Multiple Decoder (SE2MD) architecture. Furthermore, to enhance feature mapping in the scale domain and appropriately increase model capacity, we introduce a Learnable Scale Domain Filter (LSDF) after each decoder, improving adaptability under different visual conditions. Our results show that the Wavelet Policy maintaining a comparable parameter count outperforms SOTA end-to-end methods on four challenging simulation robotic arm tasks and real tasks, especially at critical moments and remote settings simultaneously. We release the source code and model checkpoint of simulation task at https://github.com/lurenjia384/Wavelet_Policy.
中文摘要:提出的Wavelet Policy通过小波变换和多尺度特征提取重构模仿学习策略,能更好地捕捉机器人动作中的节律信息,在仿真和真实任务中均展现出优于现有方法的性能表现。
English Summary: The proposed Wavelet Policy reframes imitation learning using wavelet transforms and multi-scale feature extraction to better capture rhythmic patterns in robotic actions, demonstrating superior performance over state-of-the-art methods in both simulated and real-world tasks.

Authors:Aditya Hemant Shahane, Prathosh A. P, Sandeep Kumar
Title: GOTHAM: Graph Class Incremental Learning Framework under Weak Supervision
Abstract:
Graphs are growing rapidly, along with the number of distinct label categories associated with them. Applications like e-commerce, healthcare, recommendation systems, and various social media platforms are rapidly moving towards graph representation of data due to their ability to capture both structural and attribute information. One crucial task in graph analysis is node classification, where unlabeled nodes are categorized into predefined classes. In practice, novel classes appear incrementally sometimes with just a few labels (seen classes) or even without any labels (unseen classes), either because they are new or haven't been explored much. Traditional methods assume abundant labeled data for training, which isn't always feasible. We investigate a broader objective: \emph{Graph Class Incremental Learning under Weak Supervision (GCL)}, addressing this challenge by meta-training on base classes with limited labeled instances. During the incremental streams, novel classes can have few-shot or zero-shot representation. Our proposed framework GOTHAM efficiently accommodates these unlabeled nodes by finding the closest prototype representation, serving as class representatives in the attribute space. For Text-Attributed Graphs (TAGs), our framework additionally incorporates semantic information to enhance the representation. By employing teacher-student knowledge distillation to mitigate forgetting, GOTHAM achieves promising results across various tasks. Experiments on datasets such as Cora-ML, Amazon, and OBGN-Arxiv showcase the effectiveness of our approach in handling evolving graph data under limited supervision. The repository is available here: \href{https://github.com/adityashahane10/GOTHAM--Graph-based-Class-Incremental-Learning-Framework-under-Weak-Supervision}{\small \textcolor{blue}{Code}}
中文: GOTHAM框架通过基于有限标签的基础类进行元训练,采用原型表示和知识蒸馏技术,有效处理少样本或无标签的新类,并结合文本属性图的语义信息,实现了弱监督下的图类增量学习。
English: The GOTHAM framework addresses graph class incremental learning under weak supervision by meta-training on base classes with limited labels, using prototype representation and knowledge distillation to handle novel classes with few or zero labels while incorporating semantic information for text-attributed graphs.

Authors:Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi
Title: L3AC: Towards a Lightweight and Lossless Audio Codec
Abstract:
Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and provide discrete tokens for generative modeling. However, leading approaches often rely on resource-intensive models and complex multi-quantizer architectures, limiting their practicality in real-world applications. In this work, we introduce L3AC, a lightweight neural audio codec that addresses these challenges by leveraging a single quantizer and a highly efficient architecture. To enhance reconstruction fidelity while minimizing model complexity, L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv--a novel structure designed to capture acoustic variations across multiple temporal scales. Despite its compact design, extensive experiments across diverse datasets demonstrate that L3AC matches or exceeds the reconstruction quality of leading codecs while reducing computational overhead by an order of magnitude. The single-quantizer design further enhances its adaptability for downstream tasks. The source code is publicly available at https://github.com/zhai-lw/L3AC.
Chinese: L3AC是一种轻量级神经音频编解码器,采用单一量化器和高效架构,在显著降低计算开销的同时,其重建质量达到甚至超越了主流编解码器。
English: L3AC is a lightweight neural audio codec that uses a single quantizer and efficient architecture to match or surpass leading codecs in reconstruction quality while significantly reducing computational costs.

Authors:Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang
Title: Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Abstract:
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.
中文:Collab-RAG通过让小型语言模型分解复杂问题、大型语言模型提供反馈,有效提升了多跳问答性能,无需昂贵蒸馏即可实现卓越表现。
English: Collab-RAG enhances multi-hop question-answering by enabling a small language model to decompose complex queries and a large language model to provide feedback, achieving superior performance without expensive distillation.

Authors:Georg Ahnert, Elena Wurth, Markus Strohmaier, Jutta Mata
Title: Simulating Persuasive Dialogues on Meat Reduction with Generative Agents
Abstract:
Meat reduction benefits human and planetary health, but social norms keep meat central in shared meals. To date, the development of communication strategies that promote meat reduction while minimizing social costs has required the costly involvement of human participants at each stage of the process. We present work in progress on simulating multi-round dialogues on meat reduction between Generative Agents based on large language models (LLMs). We measure our main outcome using established psychological questionnaires based on the Theory of Planned Behavior and additionally investigate Social Costs. We find evidence that our preliminary simulations produce outcomes that are (i) consistent with theoretical expectations; and (ii) valid when compared to data from previous studies with human participants. Generative agent-based models are a promising tool for identifying novel communication strategies on meat reduction-tailored to highly specific participant groups-to then be tested in subsequent studies with human participants.
中文: 本研究利用基于大语言模型的生成智能体模拟关于减少肉类消费的对话,初步结果显示其与理论预期及前人研究数据一致,为开发针对性沟通策略提供了高效方法。
English: This research explores using Generative Agents based on large language models to simulate dialogues on meat reduction, showing promising results that align with theoretical expectations and previous human studies, offering a cost-effective method to develop tailored communication strategies.

Authors:Yizhou Dang, Yuting Liu, Enneng Yang, Minhan Huang, Guibing Guo, Jianzhe Zhao, Xingwei Wang
Title: Data Augmentation as Free Lunch: Exploring the Test-Time Augmentation for Sequential Recommendation
Abstract:
Data augmentation has become a promising method of mitigating data sparsity in sequential recommendation. Existing methods generate new yet effective data during model training to improve performance. However, deploying them requires retraining, architecture modification, or introducing additional learnable parameters. The above steps are time-consuming and costly for well-trained models, especially when the model scale becomes large. In this work, we explore the test-time augmentation (TTA) for sequential recommendation, which augments the inputs during the model inference and then aggregates the model's predictions for augmented data to improve final accuracy. It avoids significant time and cost overhead from loss calculation and backward propagation. We first experimentally disclose the potential of existing augmentation operators for TTA and find that the Mask and Substitute consistently achieve better performance. Further analysis reveals that these two operators are effective because they retain the original sequential pattern while adding appropriate perturbations. Meanwhile, we argue that these two operators still face time-consuming item selection or interference information from mask tokens. Based on the analysis and limitations, we present TNoise and TMask. The former injects uniform noise into the original representation, avoiding the computational overhead of item selection. The latter blocks mask token from participating in model calculations or directly removes interactions that should have been replaced with mask tokens. Comprehensive experiments demonstrate the effectiveness, efficiency, and generalizability of our method. We provide an anonymous implementation at https://github.com/KingGugu/TTA4SR.
中文: 本文提出针对序列推荐的测试时增强方法,在模型推理阶段通过增强输入数据提升精度而无需重新训练,并开发了TNoise和TMask两种高效方案,在保持序列模式的同时显著优化了计算性能。
English: This paper introduces test-time augmentation (TTA) for sequential recommendation to enhance model accuracy during inference without retraining, proposing efficient methods TNoise and TMask that outperform existing operators by balancing pattern retention with computational efficiency.

Authors:Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Hongliang Li
Title: Unsupervised Ego- and Exo-centric Dense Procedural Activity Captioning via Gaze Consensus Adaptation
Abstract:
Even from an early age, humans naturally adapt between exocentric (Exo) and egocentric (Ego) perspectives to understand daily procedural activities. Inspired by this cognitive ability, we propose a novel Unsupervised Ego-Exo Dense Procedural Activity Captioning (UE$^{2}$DPAC) task, which aims to transfer knowledge from the labeled source view to predict the time segments and descriptions of action sequences for the target view without annotations. Despite previous works endeavoring to address the fully-supervised single-view or cross-view dense video captioning, they lapse in the proposed task due to the significant inter-view gap caused by temporal misalignment and irrelevant object interference. Hence, we propose a Gaze Consensus-guided Ego-Exo Adaptation Network (GCEAN) that injects the gaze information into the learned representations for the fine-grained Ego-Exo alignment. Specifically, we propose a Score-based Adversarial Learning Module (SALM) that incorporates a discriminative scoring network and compares the scores of distinct views to learn unified view-invariant representations from a global level. Then, the Gaze Consensus Construction Module (GCCM) utilizes the gaze to progressively calibrate the learned representations to highlight the regions of interest and extract the corresponding temporal contexts. Moreover, we adopt hierarchical gaze-guided consistency losses to construct gaze consensus for the explicit temporal and spatial adaptation between the source and target views. To support our research, we propose a new EgoMe-UE$^{2}$DPAC benchmark, and extensive experiments demonstrate the effectiveness of our method, which outperforms many related methods by a large margin. Code is available at https://github.com/ZhaofengSHI/GCEAN.
中文摘要:受人类自然切换外中心与自我中心视角的认知能力启发,本研究提出一种无监督方法,通过注入凝视信息实现双视角对齐,无需标注即可生成密集视频描述。
English Summary: Humans naturally switch between exocentric and egocentric perspectives, inspiring a novel unsupervised method that uses gaze information to align these views for dense video captioning without annotations.

Authors:Pengju Sun, Banglei Guan, Zhenbao Yu, Yang Shang, Qifeng Yu, Daniel Barath
Title: Learning Affine Correspondences by Integrating Geometric Constraints
Abstract:
Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The code is available at https://github.com/stilcrad/DenseAffine.
Chinese: 本文提出了一种新颖的流程,通过结合密集匹配和几何约束来改进仿射对应关系的提取,在图像匹配和姿态估计任务中相比现有方法展现出更高的准确性和鲁棒性。
English: This paper introduces a novel pipeline that enhances affine correspondence extraction by combining dense matching with geometric constraints, achieving superior accuracy and robustness in image matching and pose estimation tasks compared to existing methods.

Authors:Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou
Title: Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Abstract:
Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.
中文: 近期推理语言模型的进展面临高推理成本,本研究系统评估了量化的影响,发现W8A8或W4A16量化可实现无损压缩,但更低比特位宽会带来精度风险,模型规模和任务难度是关键影响因素。
English: Recent advances in reasoning language models face high inference costs, and this study systematically evaluates quantization's impact, finding that while W8A8 or W4A16 can achieve lossless compression, lower bit-widths risk accuracy, with model size and task difficulty being key factors.

Authors:Timo Brand, Daniel Faber, Stephan Held, Petra Mutzel
Title: A Customized SAT-based Solver for Graph Coloring
Abstract:
We introduce ZykovColor, a novel SAT-based algorithm to solve the graph coloring problem working on top of an encoding that mimics the Zykov tree. Our method is based on an approach of Hébrard and Katsirelos (2020) that employs a propagator to enforce transitivity constraints, incorporate lower bounds for search tree pruning, and enable inferred propagations. We leverage the recently introduced IPASIR-UP interface for CaDiCal to implement these techniques with a SAT solver. Furthermore, we propose new features that take advantage of the underlying SAT solver. These include modifying the integrated decision strategy with vertex domination hints and using incremental bottom-up search that allows to reuse learned clauses from previous calls. Additionally, we integrate a more efficient clique computation to improve the lower bounds during the search. We validate the effectiveness of each new feature through an experimental analysis. ZykovColor outperforms other state-of-the-art graph coloring implementations on the DIMACS benchmark set. Further experiments on random Erdős-Rényi graphs show that our new approach dominates state-of-the-art SAT-based methods for both very sparse and highly dense graphs.
中文: ZykovColor是一种基于SAT的新型图着色算法,通过顶点支配提示和增量搜索等特性提升效率,在基准测试中优于现有最优方法。
English: ZykovColor is a novel SAT-based algorithm for graph coloring that enhances efficiency through features like vertex domination hints and incremental search, outperforming state-of-the-art methods on benchmarks.

Authors:Tengjun Jin, Yuxuan Zhu, Daniel Kang
Title: ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
Abstract:
Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ELT-Bench.
中文摘要:ELT-Bench作为端到端基准测试平台被提出,用于评估AI代理构建复杂ELT管道的能力,实验结果显示当前最优代理仅能正确生成3.9%的数据模型,揭示了自动化数据工程工作流仍面临重大挑战。
English Summary: ELT-Bench is introduced as an end-to-end benchmark to evaluate AI agents' capabilities in building complex ELT pipelines, with experimental results showing current agents can correctly generate only 3.9% of data models, highlighting significant challenges in automating data engineering workflows.

Authors:Tianyang Wu, Lipeng Wan, Yuhang Wang, Qiang Wan, Xuguang Lan
Title: Playing Non-Embedded Card-Based Games with Reinforcement Learning
Abstract:
Significant progress has been made in AI for games, including board games, MOBA, and RTS games. However, complex agents are typically developed in an embedded manner, directly accessing game state information, unlike human players who rely on noisy visual data, leading to unfair competition. Developing complex non-embedded agents remains challenging, especially in card-based RTS games with complex features and large state spaces. We propose a non-embedded offline reinforcement learning training strategy using visual inputs to achieve real-time autonomous gameplay in the RTS game Clash Royale. Due to the lack of a object detection dataset for this game, we designed an efficient generative object detection dataset for training. We extract features using state-of-the-art object detection and optical character recognition models. Our method enables real-time image acquisition, perception feature fusion, decision-making, and control on mobile devices, successfully defeating built-in AI opponents. All code is open-sourced at https://github.com/wty-yy/katacr.
中文摘要:本研究提出了一种基于视觉输入的非嵌入式强化学习方法,通过创建生成式目标检测数据集解决了《皇室战争》中目标检测数据缺失的难题,实现了移动端实时自主游戏并成功击败内置AI对手。
English Summary: This study introduces a non-embedded reinforcement learning approach using visual inputs to enable real-time autonomous gameplay in Clash Royale, overcoming challenges like the absence of object detection datasets by creating a generative dataset and achieving victory against built-in AI opponents.

Authors:Inhwan Bae, Junoh Lee, Hae-Gon Jeon
Title: Continuous Locomotive Crowd Behavior Generation
Abstract:
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at https://github.com/InhwanBae/CrowdES .
中文: 本文提出了一种新颖框架,通过交替使用人群发射器和模拟器来生成具有异质行为的连续逼真人群轨迹,该方法在不同地理环境中展现出优异泛化能力且支持全流程用户控制。
English: This paper introduces a novel framework for generating continuous, realistic crowd trajectories with heterogeneous behaviors using an alternating emitter-simulator approach, which demonstrates superior generalization across environments and offers full user control.

Authors:Xiongbo Lu, Yaxiong Chen, Shengwu Xiong
Title: AnyArtisticGlyph: Multilingual Controllable Artistic Glyph Generation
Abstract:
Artistic Glyph Image Generation (AGIG) differs from current creativity-focused generation models by offering finely controllable deterministic generation. It transfers the style of a reference image to a source while preserving its content. Although advanced and promising, current methods may reveal flaws when scrutinizing synthesized image details, often producing blurred or incorrect textures, posing a significant challenge. Hence, we introduce AnyArtisticGlyph, a diffusion-based, multilingual controllable artistic glyph generation model. It includes a font fusion and embedding module, which generates latent features for detailed structure creation, and a vision-text fusion and embedding module that uses the CLIP model to encode references and blends them with transformation caption embeddings for seamless global image generation. Moreover, we incorporate a coarse-grained feature-level loss to enhance generation accuracy. Experiments show that it produces natural, detailed artistic glyph images with state-of-the-art performance. Our project will be open-sourced on https://github.com/jiean001/AnyArtisticGlyph to advance text generation technology.
Chinese: AnyArtisticGlyph是一种基于扩散模型的多语言可控艺术字形生成方法,通过融合字体和视觉文本模块,能够生成细节丰富、自然逼真的图像,并达到顶尖性能水平。
English: AnyArtisticGlyph is a diffusion-based model that enables multilingual controllable artistic glyph generation by integrating font and vision-text fusion modules, producing high-quality, detailed images with state-of-the-art performance.

Authors:Samarth Mishra, Kate Saenko, Venkatesh Saligrama
Title: Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Abstract:
Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.
中文: 多模态大语言模型在组合推理方面存在困难,但通过SCRAMBLe方法,利用合成偏好数据进行训练,可显著提升其在Winoground等基准测试和通用视觉问答任务中的表现。
English: Multimodal large language models struggle with compositional reasoning, but the SCRAMBLe method enhances this by training them on synthetic preference data, leading to significant improvements on benchmarks like Winoground and general visual question answering tasks.

Authors:Samarth Mishra, Kate Saenko, Venkatesh Saligrama
Title: SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data
Abstract:
Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.
中文: 多模态大语言模型在组合推理方面存在困难,但通过SCRAMBLe方法,利用合成偏好数据进行训练,可显著提升其在Winoground等基准测试和通用视觉问答任务中的表现。
English: Multimodal large language models struggle with compositional reasoning, but the SCRAMBLe method enhances this by training them on synthetic preference data, leading to significant improvements on benchmarks like Winoground and general visual question answering tasks.

Authors:Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection
Abstract:
3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++
中文: 本文通过引入3D物体检测辅助分支来增强自动驾驶车辆的3D语义占据预测,提升了中间特征对小尺寸动态物体(如易受伤害道路使用者)的捕捉能力,并在nuScenes数据集上取得了最优性能。
English: This paper enhances 3D semantic occupancy prediction for autonomous vehicles by incorporating a 3D object detection auxiliary branch, which improves feature representation for detecting small dynamic objects like vulnerable road users and achieves state-of-the-art performance on nuScenes datasets.

Authors:Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo, Xingwei Wang
Title: Can LLM-Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials
Abstract:
Hard negative samples can accelerate model convergence and optimize decision boundaries, which is key to improving the performance of recommender systems. Although large language models (LLMs) possess strong semantic understanding and generation capabilities, systematic research has not yet been conducted on how to generate hard negative samples effectively. To fill this gap, this paper introduces the concept of Semantic Negative Sampling and exploreshow to optimize LLMs for high-quality, hard negative sampling. Specifically, we design an experimental pipeline that includes three main modules, profile generation, semantic negative sampling, and semantic alignment, to verify the potential of LLM-driven hard negative sampling in enhancing the accuracy of collaborative filtering (CF). Experimental results indicate that hard negative samples generated based on LLMs, when semantically aligned and integrated into CF, can significantly improve CF performance, although there is still a certain gap compared to traditional negative sampling methods. Further analysis reveals that this gap primarily arises from two major challenges: noisy samples and lack of behavioral constraints. To address these challenges, we propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals. Experimental results show that this framework outperforms traditional negative sampling and other LLM-driven recommendation methods across multiple datasets, providing new solutions for empowering traditional RS with LLMs. Additionally, we validate the excellent generalization ability of the LLM-based semantic negative sampling method on new datasets, demonstrating its potential in alleviating issues such as data sparsity, popularity bias, and the problem of false hard negative samples. Our implementation code is available at https://github.com/user683/HNLMRec.
中文: 本文提出利用大语言模型进行语义负采样,生成硬负样本来提升协同过滤性能,并通过HNLMRec框架解决噪声样本和行为约束缺失等挑战,该方法优于传统负采样技术,并能缓解数据稀疏性、流行度偏差等问题。
English: This paper introduces Semantic Negative Sampling using large language models to generate hard negative samples that enhance collaborative filtering performance, proposing the HNLMRec framework to overcome challenges like noisy samples and lack of behavioral constraints, which outperforms traditional methods and addresses issues like data sparsity and popularity bias.

Authors:Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, Rema Padman
Title: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Abstract:
Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.
中文摘要:本综述系统回顾了大语言模型多轮交互评估与增强的最新进展,涵盖多领域任务场景下的基准测试、优化方法及未来研究方向。
English Summary: This survey comprehensively reviews recent progress in evaluating and improving multi-turn interactions in large language models, covering benchmarks, enhancement methods, and future research directions across various task-specific scenarios.

Authors:Haoren Zhao, Tianyi Chen, Zhen Wang
Title: On the Robustness of GUI Grounding Models Against Image Attacks
Abstract:
Graphical User Interface (GUI) grounding models are crucial for enabling intelligent agents to understand and interact with complex visual interfaces. However, these models face significant robustness challenges in real-world scenarios due to natural noise and adversarial perturbations, and their robustness remains underexplored. In this study, we systematically evaluate the robustness of state-of-the-art GUI grounding models, such as UGround, under three conditions: natural noise, untargeted adversarial attacks, and targeted adversarial attacks. Our experiments, which were conducted across a wide range of GUI environments, including mobile, desktop, and web interfaces, have clearly demonstrated that GUI grounding models exhibit a high degree of sensitivity to adversarial perturbations and low-resolution conditions. These findings provide valuable insights into the vulnerabilities of GUI grounding models and establish a strong benchmark for future research aimed at enhancing their robustness in practical applications. Our code is available at https://github.com/ZZZhr-1/Robust_GUI_Grounding.
Chinese: 本研究系统评估了先进GUI接地模型的鲁棒性,发现它们在多种界面环境下对对抗性扰动和低分辨率条件高度敏感。
English: This study systematically assesses the robustness of state-of-the-art GUI grounding models, revealing their high sensitivity to adversarial perturbations and low-resolution conditions across various interfaces.

Authors:Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song
Title: Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Abstract:
The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit
中文摘要:商业大语言模型API存在信任问题,服务商可能暗中替换廉价模型,软件检测方法效果不佳,而基于可信执行环境的硬件级安全方案能以较小性能开销提供可靠解决方案。
English Summary: Commercial LLM APIs face a trust issue where providers may secretly substitute cheaper models, and software detection methods prove unreliable, but hardware-level security using Trusted Execution Environments offers a robust solution with minimal performance impact.

Authors:Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song
Title: Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Abstract:
Commercial Large Language Model (LLM) APIs create a fundamental trust problem: users pay for specific models but have no guarantee that providers deliver them faithfully. Providers may covertly substitute cheaper alternatives (e.g., quantized versions, smaller models) to reduce costs while maintaining advertised pricing. We formalize this model substitution problem and systematically evaluate detection methods under realistic adversarial conditions. Our empirical analysis reveals that software-only methods are fundamentally unreliable: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while methods using log probabilities are defeated by inherent inference nondeterminism in production environments. We argue that this verification gap can be more effectively closed with hardware-level security. We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution. Our findings demonstrate that TEEs can provide provable cryptographic guarantees of model integrity with only a modest performance overhead, offering a clear and actionable path to ensure users get what they pay for. Code is available at https://github.com/sunblaze-ucb/llm-api-audit
中文摘要:商业大语言模型API存在信任问题,服务商可能暗中替换廉价模型,软件检测方法效果不佳,而基于可信执行环境的硬件级安全方案能以较小性能开销提供可靠解决方案。
English Summary: Commercial LLM APIs face a trust issue where providers may secretly substitute cheaper models, and software detection methods prove unreliable, but hardware-level security using Trusted Execution Environments offers a robust solution with minimal performance impact.

Authors:Xiaolun Jing, Genke Yang, Jian Chu
Title: TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Abstract:
Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.
中文摘要:本文提出的文本条件多粒度对比(TC-MGC)框架通过生成文本条件化的视频特征,结合相似度重组和去相关正则化方法,有效解决了视频检索中文本无关信息干扰的问题,在多个基准测试中取得了优越性能。
English Summary: The proposed Text-Conditioned Multi-Grained Contrast (TC-MGC) framework addresses misleading video representations in text-video retrieval by generating text-conditioned video features and implementing similarity reorganization with decorrelation regularization, achieving competitive performance across multiple benchmarks.

Authors:Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li
Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Abstract:
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
中文摘要:LagKV是一种创新的KV缓存压缩方法,通过直接比较KV条目而无需注意力权重,实现了便捷集成和在长上下文大语言模型推理中的卓越性能。
English Summary: LagKV is a novel KV cache compression method that eliminates the need for attention weights by comparing KV entries directly, offering easy integration and superior performance in long-context LLM inference.

Authors:Bo-Wen Yin, Jiao-Long Cao, Ming-Ming Cheng, Qibin Hou
Title: DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
Abstract:
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP-RGBD/DFormer.
中文摘要:DFormerv2提出了一种创新的RGBD编码器,将深度图作为几何先验来引导自注意力机制中的权重分配,无需对深度信息进行神经网络显式编码,即在RGBD语义分割任务中实现了卓越性能。
English Summary: DFormerv2 introduces a novel RGBD encoder that treats depth maps as geometric priors to guide attention mechanisms in self-attention, achieving state-of-the-art performance in RGBD semantic segmentation without explicit neural network encoding of depth information.

Authors:Wanzhou Liu, Zhexiao Xiong, Xinyu Li, Nathan Jacobs
Title: DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal
Abstract:
Recent novel view synthesis (NVS) techniques, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly advanced 3D scene reconstruction with high-quality rendering and realistic detail recovery. Effectively removing occlusions while preserving scene details can further enhance the robustness and applicability of these techniques. However, existing approaches for object and occlusion removal predominantly rely on generative priors, which, despite filling the resulting holes, introduce new artifacts and blurriness. Moreover, existing benchmark datasets for evaluating occlusion removal methods lack realistic complexity and viewpoint variations. To address these issues, we introduce DeclutterSet, a novel dataset featuring diverse scenes with pronounced occlusions distributed across foreground, midground, and background, exhibiting substantial relative motion across viewpoints. We further introduce DeclutterNeRF, an occlusion removal method free from generative priors. DeclutterNeRF introduces joint multi-view optimization of learnable camera parameters, occlusion annealing regularization, and employs an explainable stochastic structural similarity loss, ensuring high-quality, artifact-free reconstructions from incomplete images. Experiments demonstrate that DeclutterNeRF significantly outperforms state-of-the-art methods on our proposed DeclutterSet, establishing a strong baseline for future research.
中文:近期的新视角合成技术如NeRF和3DGS显著提升了三维场景重建质量,但现有遮挡去除方法依赖生成先验导致伪影,且缺乏真实复杂数据集;为此提出的DeclutterSet数据集和DeclutterNeRF方法通过多视角联合优化实现了无伪影的高质量重建。
English: Recent advances in novel view synthesis techniques like NeRF and 3DGS have improved 3D scene reconstruction, but existing occlusion removal methods often introduce artifacts and lack realistic evaluation datasets, prompting the introduction of DeclutterSet and DeclutterNeRF for artifact-free, high-quality reconstructions.

Authors:Tomasz Kacprzak, Francois Kamper, Michael W. Heiss, Gianluca Janka, Ann M. Dillner, Satoshi Takahama
Title: Scalable Approximate Algorithms for Optimal Transport Linear Models
Abstract:
Recently, linear regression models incorporating an optimal transport (OT) loss have been explored for applications such as supervised unmixing of spectra, music transcription, and mass spectrometry. However, these task-specific approaches often do not generalize readily to a broader class of linear models. In this work, we propose a novel algorithmic framework for solving a general class of non-negative linear regression models with an entropy-regularized OT datafit term, based on Sinkhorn-like scaling iterations. Our framework accommodates convex penalty functions on the weights (e.g. squared-$\ell_2$ and $\ell_1$ norms), and admits additional convex loss terms between the transported marginal and target distribution (e.g. squared error or total variation). We derive simple multiplicative updates for common penalty and datafit terms. This method is suitable for large-scale problems due to its simplicity of implementation and straightforward parallelization.
中文: 本文提出了一种基于熵正则化最优传输损失的非负线性回归算法框架,通过简单的乘法更新支持多种凸惩罚项和数据拟合项,适用于大规模问题。
English: This paper introduces a scalable algorithmic framework for non-negative linear regression using entropy-regularized optimal transport loss, supporting various convex penalties and datafit terms through simple multiplicative updates.

Authors:Avaljot Singh, Yasmin Chandini Sarita, Charith Mendis, Gagandeep Singh
Title: Automated Verification of Soundness of DNN Certifiers
Abstract:
The uninterpretability of Deep Neural Networks (DNNs) hinders their use in safety-critical applications. Abstract Interpretation-based DNN certifiers provide promising avenues for building trust in DNNs. Unsoundness in the mathematical logic of these certifiers can lead to incorrect results. However, current approaches to ensure their soundness rely on manual, expert-driven proofs that are tedious to develop, limiting the speed of developing new certifiers. Automating the verification process is challenging due to the complexity of verifying certifiers for arbitrary DNN architectures and handling diverse abstract analyses. We introduce ProveSound, a novel verification procedure that automates the soundness verification of DNN certifiers for arbitrary DNN architectures. Our core contribution is the novel concept of a symbolic DNN, using which, ProveSound reduces the soundness property, a universal quantification over arbitrary DNNs, to a tractable symbolic representation, enabling verification with standard SMT solvers. By formalizing the syntax and operational semantics of ConstraintFlow, a DSL for specifying certifiers, ProveSound efficiently verifies both existing and new certifiers, handling arbitrary DNN architectures. Our code is available at https://github.com/uiuc-focal-lab/constraintflow.git
Chinese: ProveSound通过引入符号深度神经网络,将DNN验证器的可靠性验证自动化,简化了处理任意网络架构的复杂过程,并利用SMT求解器有效验证现有及新型验证器。
English: ProveSound automates the verification of soundness for DNN certifiers by using symbolic DNNs to simplify the process into a tractable form solvable with SMT solvers, addressing the limitations of manual proofs.

Authors:Motoki Abe, Shinpei Hayashi
Title: ICCheck: A Portable, Language-Agnostic Tool for Synchronizing Code Clones
Abstract:
Inconsistent modifications to code clones can lead to software defects. Many approaches exist to support consistent modifications based on clone detection and/or change pattern extraction. However, no tool currently supports synchronization of code clones across diverse programming languages and development environments. We propose ICCheck, a tool designed to be language-agnostic and portable across various environments. By leveraging an existing language-agnostic clone search technique and limiting the tool's external dependency to an existing Git repository, we developed a tool that can assist in synchronizing code clones in diverse environments. We validated the tool's functionality in multiple open-source repositories, demonstrating its language independence. Furthermore, by supporting the Language Server Protocol, we confirmed that ICCheck can be integrated into multiple development environments with minimal effort. ICCheck is available at https://github.com/salab/iccheck
中文:ICCheck 是一种与编程语言无关的工具,通过利用Git仓库并支持语言服务器协议,可在不同开发环境中实现跨语言代码克隆的同步修改,便于集成使用。
English: ICCheck is a language-agnostic tool that enables synchronized modifications of code clones across different programming languages and development environments by leveraging Git repositories and supporting the Language Server Protocol for easy integration.

Authors:Weikai Lin, Tianrui Ma, Adith Boloor, Yu Feng, Ruofan Xing, Xuan Zhang, Yuhao Zhu
Title: SnapPix: Efficient-Coding--Inspired In-Sensor Compression for Edge Vision
Abstract:
Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: https://github.com/horizon-research/SnapPix.
中文: SnapPix是一种传感器-算法协同设计的系统,通过编码曝光在模拟域进行压缩,在保持遥感应用性能的同时显著降低了边缘能耗。
English: SnapPix is a sensor-algorithm co-designed system that employs coded exposure for analog domain compression to significantly reduce edge energy consumption while maintaining performance in remote sensing applications.

Authors:Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, Yuting Liu
Title: Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Abstract:
Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.
中文:近期大语言模型在推理任务中采用基于奖励的优化方法如PPO,在人类对齐任务中采用基于偏好的方法如DPO,但两者分别存在奖励破解和性能不足的问题,因此提出TRPA算法,结合规则与偏好优化,在推理任务中实现稳定且具竞争力的性能。
English: Recent advancements in LLMs utilize reward-based methods like PPO for reasoning and preference-based approaches like DPO for alignment, yet both face challenges such as reward hacking and performance gaps, prompting the proposal of the TRPA algorithm that combines rule-based and preference optimization to achieve competitive reasoning results with stability.

Authors:Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang
Title: SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Abstract:
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
中文: SAM2MOT提出了一种新颖的基于分割的多目标跟踪范式,利用SAM2的分割能力实现零样本泛化和强目标关联,在多个数据集上的实验结果表明其达到了最先进的性能水平。
English: SAM2MOT introduces a novel Tracking by Segmentation paradigm for multi-object tracking, leveraging SAM2's segmentation capabilities to achieve zero-shot generalization and strong object association, with experimental results demonstrating state-of-the-art performance on multiple datasets.

Authors:Jiancheng Pan, Yanxing Liu, Xiao He, Long Peng, Jiahao Li, Yuze Sun, Xiaomeng Huang
Title: Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection
Abstract:
Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.
中文: 基于GroundingDINO和LAE-DINO等基础模型,通过图像增强和网格化子域搜索策略显著提升了跨域少样本检测性能,实现了无需重复训练的高效参数优化。
English: Foundation models like GroundingDINO and LAE-DINO achieve superior cross-domain few-shot object detection through image augmentation and grid-based sub-domain search, enabling efficient parameter optimization without extensive retraining.

Authors:Archana Sahu, Plaban Kumar Bhowmick
Title: Directed Graph-alignment Approach for Identification of Gaps in Short Answers
Abstract:
In this paper, we have presented a method for identifying missing items known as gaps in the student answers by comparing them against the corresponding model answer/reference answers, automatically. The gaps can be identified at word, phrase or sentence level. The identified gaps are useful in providing feedback to the students for formative assessment. The problem of gap identification has been modelled as an alignment of a pair of directed graphs representing a student answer and the corresponding model answer for a given question. To validate the proposed approach, the gap annotated student answers considering answers from three widely known datasets in the short answer grading domain, namely, University of North Texas (UNT), SciEntsBank, and Beetle have been developed and this gap annotated student answers' dataset is available at: https://github.com/sahuarchana7/gaps-answers-dataset. Evaluation metrics used in the traditional machine learning tasks have been adopted to evaluate the task of gap identification. Though performance of the proposed approach varies across the datasets and the types of the answers, overall the performance is observed to be promising.
本文提出了一种通过将学生答案与标准答案进行有向图对齐来自动检测其中缺失内容的方法,该方法在多个数据集上表现出良好的性能,适用于形成性评价的反馈环节。
This paper introduces an automated method for detecting gaps in student answers by aligning them with model answers using directed graphs, which proves effective for formative feedback across multiple datasets.

Authors:Shuolong Chen, Xingxing Li, Liu Yuan
Title: eKalibr-Stereo: Continuous-Time Spatiotemporal Calibration for Event-Based Stereo Visual Systems
Abstract:
The bioinspired event camera, distinguished by its exceptional temporal resolution, high dynamic range, and low power consumption, has been extensively studied in recent years for motion estimation, robotic perception, and object detection. In ego-motion estimation, the stereo event camera setup is commonly adopted due to its direct scale perception and depth recovery. For optimal stereo visual fusion, accurate spatiotemporal (extrinsic and temporal) calibration is required. Considering that few stereo visual calibrators orienting to event cameras exist, based on our previous work eKalibr (an event camera intrinsic calibrator), we propose eKalibr-Stereo for accurate spatiotemporal calibration of event-based stereo visual systems. To improve the continuity of grid pattern tracking, building upon the grid pattern recognition method in eKalibr, an additional motion prior-based tracking module is designed in eKalibr-Stereo to track incomplete grid patterns. Based on tracked grid patterns, a two-step initialization procedure is performed to recover initial guesses of piece-wise B-splines and spatiotemporal parameters, followed by a continuous-time batch bundle adjustment to refine the initialized states to optimal ones. The results of extensive real-world experiments show that eKalibr-Stereo can achieve accurate event-based stereo spatiotemporal calibration. The implementation of eKalibr-Stereo is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
中文: 该研究提出了eKalibr-Stereo方法,用于立体事件相机的精确时空标定,通过改进网格图案追踪和采用两步初始化优化流程实现高精度校准,其实现代码已开源共享。
English: The study introduces eKalibr-Stereo, a method for precise spatiotemporal calibration of stereo event cameras, enhancing grid pattern tracking and achieving accurate results through a two-step initialization and optimization process, with its implementation made publicly available.

Authors:Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Title: UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Abstract:
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.
中文: UniToken通过结合离散与连续视觉表征的统一模型,在视觉理解和图像生成任务中均实现了最先进的性能,经多基准测试验证其卓越表现。
English: UniToken is a unified model that combines discrete and continuous visual representations to achieve state-of-the-art performance in both visual understanding and image generation tasks, as validated by extensive experiments across multiple benchmarks.

Authors:Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, Ji Wu
Title: MedM-VL: What Makes a Good Medical LVLM?
Abstract:
Medical image analysis is essential in modern healthcare. Deep learning has redirected research focus toward complex medical multimodal tasks, including report generation and visual question answering. Traditional task-specific models often fall short in handling these challenges. Large vision-language models (LVLMs) offer new solutions for solving such tasks. In this study, we build on the popular LLaVA framework to systematically explore model architectures and training strategies for both 2D and 3D medical LVLMs. We present extensive empirical findings and practical guidance. To support reproducibility and future research, we release a modular codebase, MedM-VL, and two pre-trained models: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code is available at: https://github.com/MSIIP/MedM-VL
中文摘要:本研究基于LLaVA框架开发了大型视觉语言模型以推进医学图像分析,提供了实证研究成果并发布了包含2D和3D预训练模型的MedM-VL代码库。
English Summary: This study advances medical image analysis by developing large vision-language models based on the LLaVA framework, providing empirical insights and releasing the MedM-VL codebase with pre-trained models for both 2D and 3D applications.

Authors:Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang
Title: CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Abstract:
Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems -- a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.
Chinese: 我们推出了CO-Bench这一包含36个现实组合优化问题的基准测试套件,通过系统评估LLM智能体的表现,揭示了其相对于传统算法的优势与不足,为未来研究指明方向。
English: CO-Bench is introduced as a comprehensive benchmark suite with 36 real-world combinatorial optimization problems to systematically evaluate LLM agents' capabilities, revealing their strengths and limitations compared to traditional algorithms.

Authors:Anjan Bellamkonda, Laksh Bharani, Harivatsan Selvam
Title: AbsInf: A Lightweight Object to Represent float('inf') in Dijkstra's Algorithm
Abstract:
We introduce AbsInf, a lightweight abstract object designed as a high-performance alternative to Python's native float('inf') within pathfinding algorithms. Implemented as a C-based Python extension, AbsInf bypasses IEEE-754 float coercion and dynamic type dispatch, offering constant-time dominance comparisons and arithmetic neutrality. When integrated into Dijkstra's algorithm without altering its logic, AbsInf reduces runtime by up to 17.2%, averaging 9.74% across diverse synthetic and real-world graph datasets. This optimization highlights the performance trade-offs in high-frequency algorithmic constructs, where a symbolic use of infinity permits efficient abstraction. Our findings contribute to the broader discourse on lightweight architectural enhancements for interpreted languages, particularly in performance-critical control flows.
中文: AbsInf是一种基于C语言的轻量级Python扩展,作为float('inf')的高效替代品应用于路径查找算法,通过恒定时间操作和算术中性特性,使Dijkstra算法运行时间最高减少17.2%。
English: AbsInf is a lightweight C-based Python extension that serves as an efficient replacement for float('inf') in pathfinding algorithms, reducing Dijkstra's algorithm runtime by up to 17.2% through constant-time operations and arithmetic neutrality.

Authors:Mete Ahishali, Anis Ur Rahman, Einari Heinaro, Samuli Junttila
Title: ADA-Net: Attention-Guided Domain Adaptation Network with Contrastive Learning for Standing Dead Tree Segmentation Using Aerial Imagery
Abstract:
Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at https://github.com/meteahishali/ADA-Net. The data is publicly available at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-standing-dead-tree-segmentation.
中文摘要:本研究提出ADA-Net这一新型域自适应方法,通过图像转换实现跨林区知识迁移,解决了枯立木航拍图像分割中标注数据匮乏的问题,在芬兰和美国数据集上取得了最优性能。
English Summary: This study introduces ADA-Net, a novel domain adaptation method for segmenting standing dead trees from aerial imagery, addressing data scarcity by transferring knowledge between forest regions and achieving state-of-the-art performance on datasets from Finland and the USA.

Authors:Yuantao Zhang, Zhankui Yang
Title: A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
Abstract:
The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.
中文摘要:本文提出了一种利用困惑度曲线和Menger曲率的新型指标来量化大语言模型相似度,通过实验验证能有效检测模型复制行为,维护模型原创性与完整性。
English Summary: This paper introduces a novel metric using perplexity curves and Menger curvature to quantify LLM similarity, effectively detecting model replication and preserving originality through validated experiments.

Authors:Xinyu Mao, Teerapong Leelanupab, Martin Potthast, Harrisen Scells, Guido Zuccon
Title: AiReview: An Open Platform for Accelerating Systematic Reviews with LLMs
Abstract:
Systematic reviews are fundamental to evidence-based medicine. Creating one is time-consuming and labour-intensive, mainly due to the need to screen, or assess, many studies for inclusion in the review. Several tools have been developed to streamline this process, mostly relying on traditional machine learning methods. Large language models (LLMs) have shown potential in further accelerating the screening process. However, no tool currently allows end users to directly leverage LLMs for screening or facilitates systematic and transparent usage of LLM-assisted screening methods. This paper introduces (i) an extensible framework for applying LLMs to systematic review tasks, particularly title and abstract screening, and (ii) a web-based interface for LLM-assisted screening. Together, these elements form AiReview-a novel platform for LLM-assisted systematic review creation. AiReview is the first of its kind to bridge the gap between cutting-edge LLM-assisted screening methods and those that create medical systematic reviews. The tool is available at https://aireview.ielab.io. The source code is also open sourced at https://github.com/ielab/ai-review.
Chinese: AiReview是一个创新平台,它通过可扩展的框架和网络界面,首次将大型语言模型应用于系统评价的筛选过程,有效解决了传统方法耗时费力的问题,并公开了源代码供用户使用。
English: AiReview is a pioneering platform that introduces an extensible framework and web interface for using large language models to streamline the labor-intensive process of systematic review screening, bridging the gap between advanced AI methods and medical review creation.

Authors:Bohao Wang, Feng Liu, Jiawei Chen, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, Yan Feng, Chun Chen, Can Wang
Title: MSL: Not All Tokens Are What You Need for Tuning LLM as a Recommender
Abstract:
Large language models (LLMs), known for their comprehension capabilities and extensive knowledge, have been increasingly applied to recommendation systems (RS). Given the fundamental gap between the mechanism of LLMs and the requirement of RS, researchers have focused on fine-tuning LLMs with recommendation-specific data to enhance their performance. Language Modeling Loss (LML), originally designed for language generation tasks, is commonly adopted. However, we identify two critical limitations of LML: 1) it exhibits significant divergence from the recommendation objective; 2) it erroneously treats all fictitious item descriptions as negative samples, introducing misleading training signals. To address these limitations, we propose a novel Masked Softmax Loss (MSL) tailored for fine-tuning LLMs on recommendation. MSL improves LML by identifying and masking invalid tokens that could lead to fictitious item descriptions during loss computation. This strategy can effectively avoid the interference from erroneous negative signals and ensure well alignment with the recommendation objective supported by theoretical guarantees. During implementation, we identify a potential challenge related to gradient vanishing of MSL. To overcome this, we further introduce the temperature coefficient and propose an Adaptive Temperature Strategy (ATS) that adaptively adjusts the temperature without requiring extensive hyperparameter tuning. Extensive experiments conducted on four public datasets further validate the effectiveness of MSL, achieving an average improvement of 42.24% in NDCG@10. The code is available at https://github.com/WANGBohaO-jpg/MSL.
中文: 研究者提出了一种新颖的掩码Softmax损失函数(MSL),通过屏蔽无效标记并采用自适应温度策略,有效解决了语言建模损失在推荐系统微调中的局限性,显著提升了模型性能。
English: Researchers propose a novel Masked Softmax Loss (MSL) to address the limitations of Language Modeling Loss in fine-tuning large language models for recommendation systems, achieving significant performance improvements by masking invalid tokens and incorporating an adaptive temperature strategy.

Authors:Shiguang Sun, Hanbo Zhang, Zeyang Liu, Xinrui Yang, Lipeng Wan, Xingyu Chen, Xuguang Lan
Title: MInCo: Mitigating Information Conflicts in Distracted Visual Model-based Reinforcement Learning
Abstract:
Existing visual model-based reinforcement learning (MBRL) algorithms with observation reconstruction often suffer from information conflicts, making it difficult to learn compact representations and hence result in less robust policies, especially in the presence of task-irrelevant visual distractions. In this paper, we first reveal that the information conflicts in current visual MBRL algorithms stem from visual representation learning and latent dynamics modeling with an information-theoretic perspective. Based on this finding, we present a new algorithm to resolve information conflicts for visual MBRL, named MInCo, which mitigates information conflicts by leveraging negative-free contrastive learning, aiding in learning invariant representation and robust policies despite noisy observations. To prevent the dominance of visual representation learning, we introduce time-varying reweighting to bias the learning towards dynamics modeling as training proceeds. We evaluate our method on several robotic control tasks with dynamic background distractions. Our experiments demonstrate that MInCo learns invariant representations against background noise and consistently outperforms current state-of-the-art visual MBRL methods. Code is available at https://github.com/ShiguangSun/minco.
中文:MInCo算法通过对比学习解决视觉模型强化学习中的信息冲突,学习不变表征和鲁棒策略,在噪声环境下优于现有方法。
English: The MInCo algorithm addresses information conflicts in visual model-based reinforcement learning by using contrastive learning to develop invariant representations and robust policies, outperforming existing methods in noisy environments.

Authors:Yikai Wang, Guangce Liu, Xinzhou Wang, Zilong Chen, Jiafang Li, Xin Liang, Fuchun Sun, Jun Zhu
Title: Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization
Abstract:
The advancement of 4D (i.e., sequential 3D) generation opens up new possibilities for lifelike experiences in various applications, where users can explore dynamic objects or characters from any viewpoint. Meanwhile, video generative models are receiving particular attention given their ability to produce realistic and imaginative frames. These models are also observed to exhibit strong 3D consistency, indicating the potential to act as world simulators. In this work, we present Video4DGen, a novel framework that excels in generating 4D representations from single or multiple generated videos as well as generating 4D-guided videos. This framework is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. The 4D outputs generated by Video4DGen are represented using our proposed Dynamic Gaussian Surfels (DGS), which optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. We design warped-state geometric regularization and refinements on Gaussian surfels, to preserve the structural integrity and fine-grained appearance details. To perform 4D generation from multiple videos and capture representation across spatial, temporal, and pose dimensions, we design multi-video alignment, root pose optimization, and pose-guided frame sampling strategies. The leveraging of continuous warping fields also enables a precise depiction of pose, motion, and deformation over per-video frames. Further, to improve the overall fidelity from the observation of all camera poses, Video4DGen performs novel-view video generation guided by the 4D content, with the proposed confidence-filtered DGS to enhance the quality of generated sequences. With the ability of 4D and video generation, Video4DGen offers a powerful tool for applications in virtual reality, animation, and beyond.
中文: Video4DGen是一种创新框架,能从单个或多个视频生成动态4D表示和视频,通过动态高斯表面元素技术保持时空一致性,为虚拟现实和动画等应用提供高保真内容。
English: Video4DGen is a novel framework that generates dynamic 4D representations and videos from single or multiple inputs, using Dynamic Gaussian Surfels to ensure spatial-temporal coherence and high fidelity for applications like virtual reality and animation.

Authors:Yongchuan Cui, Jinhe Zhang, Peng Liu, Weijing Song, Yi Zeng
Title: Overcoming the Identity Mapping Problem in Self-Supervised Hyperspectral Anomaly Detection
Abstract:
The surge of deep learning has catalyzed considerable progress in self-supervised Hyperspectral Anomaly Detection (HAD). The core premise for self-supervised HAD is that anomalous pixels are inherently more challenging to reconstruct, resulting in larger errors compared to the background. However, owing to the powerful nonlinear fitting capabilities of neural networks, self-supervised models often suffer from the Identity Mapping Problem (IMP). The IMP manifests as a tendency for the model to overfit to the entire image, particularly with increasing network complexity or prolonged training iterations. Consequently, the whole image can be precisely reconstructed, and even the anomalous pixels exhibit imperceptible errors, making them difficult to detect. Despite the proposal of several models aimed at addressing the IMP-related issues, a unified descriptive framework and validation of solutions for IMP remain lacking. In this paper, we conduct an in-depth exploration to IMP, and summarize a unified framework that describes IMP from the perspective of network optimization, which encompasses three aspects: perturbation, reconstruction, and regularization. Correspondingly, we introduce three solutions: superpixel pooling and uppooling for perturbation, error-adaptive convolution for reconstruction, and online background pixel mining for regularization. With extensive experiments being conducted to validate the effectiveness, it is hoped that our work will provide valuable insights and inspire further research for self-supervised HAD. Code: \url{https://github.com/yc-cui/Super-AD}.
Chinese: 本文针对自监督高光谱异常检测中的恒等映射问题,提出了一个包含扰动、重建和正则化三个方面的统一框架及相应解决方案,并通过大量实验验证了其有效性。
English: This paper addresses the Identity Mapping Problem in self-supervised Hyperspectral Anomaly Detection by proposing a unified framework with three solutions—perturbation, reconstruction, and regularization—validated through extensive experiments.

Authors:Zekai Shen, Haitao Yuan, Xiaowei Mao, Congkang Lv, Shengnan Guo, Youfang Lin, Huaiyu Wan
Title: Towards An Efficient and Effective En Route Travel Time Estimation Framework
Abstract:
En route travel time estimation (ER-TTE) focuses on predicting the travel time of the remaining route. Existing ER-TTE methods always make re-estimation which significantly hinders real-time performance, especially when faced with the computational demands of simultaneous user requests. This results in delays and reduced responsiveness in ER-TTE services. We propose a general efficient framework U-ERTTE combining an Uncertainty-Guided Decision mechanism (UGD) and Fine-Tuning with Meta-Learning (FTML) to address these challenges. UGD quantifies the uncertainty and provides confidence intervals for the entire route. It selectively re-estimates only when the actual travel time deviates from the predicted confidence intervals, thereby optimizing the efficiency of ER-TTE. To ensure the accuracy of confidence intervals and accurate predictions that need to re-estimate, FTML is employed to train the model, enabling it to learn general driving patterns and specific features to adapt to specific tasks. Extensive experiments on two large-scale real datasets demonstrate that the U-ERTTE framework significantly enhances inference speed and throughput while maintaining high effectiveness. Our code is available at https://github.com/shenzekai/U-ERTTE
中文:U-ERTTE框架通过不确定性引导决策机制和元学习微调技术,仅在行程时间偏离预测区间时进行选择性重估,在保持途中行程时间预测精度的同时,显著提升了计算效率和系统吞吐量。
English: The U-ERTTE framework introduces an uncertainty-guided decision mechanism and meta-learning fine-tuning to selectively re-estimate travel times only when necessary, significantly improving computational efficiency and throughput while maintaining accuracy in en route travel time prediction.

Authors:Xiao-Hui Li, Fei Yin, Cheng-Lin Liu
Title: DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
Abstract:
Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.
Chinese: DocSAM是一种基于Transformer的统一框架,通过结合实例分割和语义分割处理多种文档图像分割任务,在提升泛化能力和效率的同时减少了资源消耗。
English: DocSAM is a unified transformer-based framework that integrates instance and semantic segmentation to handle diverse document image segmentation tasks, improving generalization and efficiency while reducing resource usage.

Authors:Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru
Title: A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models
Abstract:
Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and OpenAI's reasoning models o1 and GPT-OSS to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4, o1 and GPT-OSS for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: LLMs exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation costs and NLP modeling needs but with increased perpetual compute costs. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available for additional benchmarking by the community: https://github.com/bionlproc/ZeroShotRE
中文摘要:大型语言模型在生物医学关系抽取任务中展现出有竞争力的零样本性能,为传统方法提供了一种成本效益高的替代方案,但需要持续的计算资源投入。
English summary: Large language models demonstrate competitive zero-shot performance in biomedical relation extraction tasks, offering a cost-effective alternative to traditional methods while requiring ongoing computational resources.

Authors:Bing Wang, Bingrui Zhao, Ximing Li, Changchun Li, Wanfu Gao, Shengsheng Wang
Title: Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator
Abstract:
Over the past decade, social media platforms have been key in spreading rumors, leading to significant negative impacts. To counter this, the community has developed various Rumor Detection (RD) algorithms to automatically identify them using user comments as evidence. However, these RD methods often fail in the early stages of rumor propagation when only limited user comments are available, leading the community to focus on a more challenging topic named Rumor Early Detection (RED). Typically, existing RED methods learn from limited semantics in early comments. However, our preliminary experiment reveals that the RED models always perform best when the number of training and test comments is consistent and extensive. This inspires us to address the RED issue by generating more human-like comments to support this hypothesis. To implement this idea, we tune a comment generator by simulating expert collaboration and controversy and propose a new RED framework named CAMERED. Specifically, we integrate a mixture-of-expert structure into a generative language model and present a novel routing network for expert collaboration. Additionally, we synthesize a knowledgeable dataset and design an adversarial learning strategy to align the style of generated comments with real-world comments. We further integrate generated and original comments with a mutual controversy fusion module. Experimental results show that CAMERED outperforms state-of-the-art RED baseline models and generation methods, demonstrating its effectiveness.
中文摘要:本研究提出CAMERED框架,通过模拟专家协作生成拟真用户评论来增强谣言早期检测能力,实验证明其性能优于现有最优模型。
English Summary: The study introduces CAMERED, a novel framework for Rumor Early Detection that enhances detection accuracy by generating realistic user comments through expert collaboration simulation and adversarial learning, outperforming existing methods.

Authors:Giovanni Barbarino, Nicolas Gillis, David Sossa
Title: Computing cone-constrained singular values of matrices
Abstract:
The concept of singular values of a rectangular matrix $A$ relative to a pair of closed convex cones $(P,Q)$ has been recently introduced by Seeger and Sossa (Cone-constrained singular value problems, Journal of Convex Analysis 30, pp. 1285-1306, 2023). These singular values are the critical (stationary) values of the non-convex optimization problem of minimizing $\langle u,Av\rangle$ such that $u$ and $v$ are unit vectors in $P$ and $Q$, respectively. When $A$ is the identity matrix, the singular values coincide with the cosine of the critical angles between $P$ and $Q$. When $P$ and $Q$ are positive orthants, the singular values are called Pareto singular values of $A$ and have applications, for instance, in spectral graph theory. This paper deals with the numerical computation of these cone-constrained singular values. We prove the NP-hardness of all the above problems, while identifying cases when such problems can be solved in polynomial time. We then propose four algorithms. Two are exact algorithms, meaning that they are guaranteed to compute a globally optimal solution; one uses an exact non-convex quadratic programming solver, and the other a brute-force active-set method. The other two are heuristics, meaning that they rapidly compute locally optimal solutions; one uses an alternating projection algorithm with extrapolation, and the other a sequential partial linearization approach based on fractional programming. We illustrate the use of these algorithms on several examples.
中文: 本文针对锥约束奇异值的NP难数值计算问题,提出了两种精确算法和两种启发式方法进行求解。
English: This paper addresses the NP-hard problem of numerically computing cone-constrained singular values, proposing two exact algorithms and two heuristic methods for their computation.

Authors:Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Title: VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Abstract:
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet
中文摘要:VocalNet通过可扩展且模型无关的训练框架,首次将多令牌预测应用于语音大语言模型,实现了高性能、低延迟的实时语音交互,在有限训练数据下性能媲美主流模型并显著超越现有开源方案。
English Summary: VocalNet introduces a scalable, model-agnostic training framework using multi-token prediction to create high-performance, low-latency speech LLMs that match mainstream models with limited data while surpassing existing open-source alternatives.

Authors:Conghao Xiong, Hao Chen, Joseph J. Y. Sung
Title: A Survey of Pathology Foundation Model: Progress and Future Directions
Abstract:
Computational pathology, which involves analyzing whole slide images for automated cancer diagnosis, relies on multiple instance learning, where performance depends heavily on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced both the extractor and aggregator, but they lack a systematic analysis framework. In this survey, we present a hierarchical taxonomy organizing PFMs through a top-down philosophy applicable to foundation model analysis in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https://github.com/BearCleverProud/AwesomeWSI.
中文摘要:本综述提出了一种层次化分类法,通过模型范围、预训练和设计三个维度系统分析病理学基础模型,并划分评估任务类别,指出了模型开发与应用中的关键挑战,为计算病理学发展指明方向。
English Summary: This survey introduces a hierarchical taxonomy for analyzing Pathology Foundation Models (PFMs) by examining model scope, pretraining, and design, while categorizing evaluation tasks and identifying key challenges in development and utilization to advance computational pathology.

Authors:Shintaro Shiba, Yoshimitsu Aoki, Guillermo Gallego
Title: Simultaneous Motion And Noise Estimation with Event Cameras
Abstract:
Event cameras are emerging vision sensors whose noise is challenging to characterize. Existing denoising methods for event cameras are often designed in isolation and thus consider other tasks, such as motion estimation, separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the one-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while demonstrating effectiveness across motion estimation and intensity reconstruction tasks. Our approach advances event-data denoising theory and expands practical denoising use-cases via open-source code. Project page: https://github.com/tub-rip/ESMD
中文: 本文提出了首个同时估计事件相机中运动与噪声的方法,在实现先进去噪性能的同时,提高了运动估计任务的灵活性。
English: This paper introduces the first method to jointly estimate motion and noise in event cameras, achieving state-of-the-art denoising performance while enhancing flexibility across motion estimation tasks.

Authors:Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong
Title: Window Token Concatenation for Efficient Visual Large Language Models
Abstract:
To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.
中文摘要:作者提出WiCo方法,通过滑动窗口拼接相邻视觉标记以减少VLLMs中的视觉标记数量,并进一步推出WiCo+在语言模型深层分解视觉标记以提升细粒度视觉理解能力,实验表明该方法在粗细粒度视觉任务上均优于现有标记缩减方案。
English Summary: The authors propose WiCo, a method that uses a sliding window to concatenate adjacent visual tokens in VLLMs, and enhance it with WiCo+ for fine-grained tasks by decomposing tokens in later layers, achieving superior performance in both coarse and fine-grained visual understanding tasks.

Authors:Houzhang Fang, Xiaolin Wang, Zengyang Li, Lu Wang, Qingshan Li, Yi Chang, Luxin Yan
Title: Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAVTarget Detection
Abstract:
Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at https://github.com/IVPLaboratory/UniCD.
中文摘要:本文提出UniCD端到端联合框架,通过参数估计和辅助损失同时实现红外非均匀性校正与无人机目标检测,在提升检测鲁棒性的同时保持实时处理能力。
English Summary: The paper introduces UniCD, an end-to-end framework that jointly performs infrared nonuniformity correction and UAV target detection through parameter estimation and auxiliary losses to enhance robustness and real-time processing.

Authors:Wenliang Zheng, Sarkar Snigdha Sarathi Das, Yusen Zhang, Rui Zhang
Title: GREATERPROMPT: A Unified, Customizable, and High-Performing Open-Source Toolkit for Prompt Optimization
Abstract:
LLMs have gained immense popularity among researchers and the general public for its impressive capabilities on a variety of tasks. Notably, the efficacy of LLMs remains significantly dependent on the quality and structure of the input prompts, making prompt design a critical factor for their performance. Recent advancements in automated prompt optimization have introduced diverse techniques that automatically enhance prompts to better align model outputs with user expectations. However, these methods often suffer from the lack of standardization and compatibility across different techniques, limited flexibility in customization, inconsistent performance across model scales, and they often exclusively rely on expensive proprietary LLM APIs. To fill in this gap, we introduce GREATERPROMPT, a novel framework that democratizes prompt optimization by unifying diverse methods under a unified, customizable API while delivering highly effective prompts for different tasks. Our framework flexibly accommodates various model scales by leveraging both text feedback-based optimization for larger LLMs and internal gradient-based optimization for smaller models to achieve powerful and precise prompt improvements. Moreover, we provide a user-friendly Web UI that ensures accessibility for non-expert users, enabling broader adoption and enhanced performance across various user groups and application scenarios. GREATERPROMPT is available at https://github.com/psunlpgroup/GreaterPrompt via GitHub, PyPI, and web user interfaces.
中文摘要:GREATERPROMPT是一个创新框架,通过统一多种自动提示优化方法到可定制的API中,解决了标准化不足和模型兼容性等问题,同时结合基于反馈和梯度的优化方法,为不同任务生成高效提示。
English Summary: GREATERPROMPT is a novel framework that unifies diverse automated prompt optimization techniques under a customizable API, addressing limitations like standardization issues and model compatibility while providing effective prompts for various tasks through both feedback-based and gradient-based methods.

Authors:Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova
Title: VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Abstract:
We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
中文: 我们推出VideoComp基准和学习框架,旨在通过多事件视频中的时序对齐测试及分层成对偏好损失与预训练策略,提升视觉语言模型在视频文本组合性理解方面的细粒度能力。
English: We introduce VideoComp, a benchmark and framework for enhancing video-text compositionality in vision-language models by testing temporal alignment through challenging disruptions in multi-event videos and improving performance with a hierarchical pairwise loss and pretraining strategy.

Authors:Arash Sajjadi, Mark Eramian
Title: TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning
Abstract:
TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.
中文: TGraphX提出了一种将卷积神经网络与图神经网络相结合的新范式,通过保留空间语义的多维节点特征和结构化信息传递机制,显著提升了视觉推理任务中的物体检测和关系推理性能。
English: TGraphX introduces a unified framework combining CNNs and GNNs to enhance visual reasoning by preserving spatial semantics through multi-dimensional node features and structured message passing, achieving notable improvements in object detection and relational tasks.

Authors:Tyler Ward, Abdullah-Al-Zubaer Imran
Title: Improving Brain Disorder Diagnosis with Advanced Brain Function Representation and Kolmogorov-Arnold Networks
Abstract:
Quantifying functional connectivity (FC), a vital metric for the diagnosis of various brain disorders, traditionally relies on the use of a pre-defined brain atlas. However, using such atlases can lead to issues regarding selection bias and lack of regard for specificity. Addressing this, we propose a novel transformer-based classification network (ABFR-KAN) with effective brain function representation to aid in diagnosing autism spectrum disorder (ASD). ABFR-KAN leverages Kolmogorov-Arnold Network (KAN) blocks replacing traditional multi-layer perceptron (MLP) components. Thorough experimentation reveals the effectiveness of ABFR-KAN in improving the diagnosis of ASD under various configurations of the model architecture. Our code is available at https://github.com/tbwa233/ABFR-KAN
中文: 本研究提出ABFR-KAN,一种基于Transformer的网络,采用Kolmogorov-Arnold网络模块来改进自闭症谱系障碍的诊断,通过优化功能连接表征并克服图谱选择偏差。
English: The study introduces ABFR-KAN, a transformer-based network using Kolmogorov-Arnold Network blocks to enhance autism spectrum disorder diagnosis by improving functional connectivity representation and overcoming atlas-related biases.

Authors:Rufei Ma, Chao Chen
Title: RF-BayesPhysNet: A Bayesian rPPG Uncertainty Estimation Method for Complex Scenarios
Abstract:
Remote photoplethysmography (rPPG) technology infers heart rate by capturing subtle color changes in facial skin using a camera, demonstrating great potential in non-contact heart rate measurement. However, measurement accuracy significantly decreases in complex scenarios such as lighting changes and head movements compared to ideal laboratory conditions. Existing deep learning models often neglect the quantification of measurement uncertainty, limiting their credibility in dynamic scenes. To address the issue of insufficient rPPG measurement reliability in complex scenarios, this paper introduces Bayesian neural networks to the rPPG field for the first time, proposing the Robust Fusion Bayesian Physiological Network (RF-BayesPhysNet), which can model both aleatoric and epistemic uncertainty. It leverages variational inference to balance accuracy and computational efficiency. Due to the current lack of uncertainty estimation metrics in the rPPG field, this paper also proposes a new set of methods, using Spearman correlation coefficient, prediction interval coverage, and confidence interval width, to measure the effectiveness of uncertainty estimation methods under different noise conditions. Experiments show that the model, with only double the parameters compared to traditional network models, achieves a MAE of 2.56 on the UBFC-RPPG dataset, surpassing most models. It demonstrates good uncertainty estimation capability in no-noise and low-noise conditions, providing prediction confidence and significantly enhancing robustness in real-world applications. We have open-sourced the code at https://github.com/AIDC-rPPG/RF-Net
中文: 本文提出RF-BayesPhysNet贝叶斯神经网络,通过建模测量不确定性提升复杂场景下远程心率监测的可靠性,在仅增加一倍参数的情况下于UBFC-RPPG数据集取得优异精度,并开发了新的不确定性评估指标。
English: This paper introduces RF-BayesPhysNet, a Bayesian neural network that models measurement uncertainty to enhance remote heart rate monitoring's reliability in complex scenarios, achieving superior accuracy on the UBFC-RPPG dataset with only doubled parameters.

Authors:Jiho Kim, Cong Hao
Title: RealProbe: An Automated and Lightweight Performance Profiler for In-FPGA Execution of High-Level Synthesis Designs
Abstract:
High-level synthesis (HLS) accelerates FPGA design by rapidly generating diverse implementations using optimization directives. However, even with cycle-accurate C/RTL co-simulation, the reported clock cycles often differ significantly from actual FPGA performance. This discrepancy hampers accurate bottleneck identification, leading to suboptimal design choices. Existing in-FPGA profiling tools, such as the Integrated Logic Analyzer (ILA), require tedious inspection of HLS-generated RTL and manual signal monitoring, reducing productivity. To address these challenges, we introduce RealProbe, the first fully automated, lightweight in-FPGA profiling tool for HLS designs. With a single directive--#pragma HLS RealProbe--the tool automatically generates all necessary code to profile cycle counts across the full function hierarchy, including submodules and loops. RealProbe extracts, records, and visualizes cycle counts with high precision, providing actionable insights into on-board performance. RealProbe is non-intrusive, implemented as independent logic to ensure minimal impact on kernel functionality or timing. It also supports automated design space exploration (DSE), optimizing resource allocation based on FPGA constraints and module complexity. By leveraging incremental synthesis and implementation, DSE runs independently of the original HLS kernel. Evaluated across 28 diverse test cases, including a large-scale design, RealProbe achieves 100% accuracy in capturing cycle counts with minimal logic overhead-just 16.98% LUTs, 43.15% FFs, and 0% BRAM usage. The tool, with full documentation and examples, is available on GitHub at https://github.com/sharc-lab/RealProbe .
中文: RealProbe 是一种用于 HLS 设计的全自动、非侵入式 FPGA 内部分析工具,能够以最小资源开销精确捕获并可视化完整函数层次结构的周期计数,从而实现精准性能分析和自动化设计空间探索。
English: RealProbe is an automated, non-intrusive in-FPGA profiling tool for HLS designs that accurately captures and visualizes cycle counts across the full function hierarchy with minimal resource overhead, enabling precise performance analysis and automated design space exploration.

Authors:Ved Umrajkar, Aakash Kumar Singh
Title: Detection Limits and Statistical Separability of Tree Ring Watermarks in Rectified Flow-based Text-to-Image Generation Models
Abstract:
Tree-Ring Watermarking is a significant technique for authenticating AI-generated images. However, its effectiveness in rectified flow-based models remains unexplored, particularly given the inherent challenges of these models with noise latent inversion. Through extensive experimentation, we evaluated and compared the detection and separability of watermarks between SD 2.1 and FLUX.1-dev models. By analyzing various text guidance configurations and augmentation attacks, we demonstrate how inversion limitations affect both watermark recovery and the statistical separation between watermarked and unwatermarked images. Our findings provide valuable insights into the current limitations of Tree-Ring Watermarking in the current SOTA models and highlight the critical need for improved inversion methods to achieve reliable watermark detection and separability. The official implementation, dataset release and all experimental results are available at this \href{https://github.com/dsgiitr/flux-watermarking}{\textbf{link}}.
Chinese: 树环水印技术在如FLUX.1-dev等整流流模型中因噪声潜在反转的挑战而存在局限性,影响了水印恢复以及水印与非水印图像间的统计可分离性。
English: Tree-Ring Watermarking faces limitations in rectified flow-based models like FLUX.1-dev due to noise latent inversion challenges, affecting both watermark recovery and statistical separability between watermarked and unwatermarked images.

Authors:Ruhui Zhang, Hezhe Qiao, Pengcheng Xu, Mingsheng Shang, Lin Chen
Title: Semantic-guided Representation Learning for Multi-Label Recognition
Abstract:
Multi-label Recognition (MLR) involves assigning multiple labels to each data instance in an image, offering advantages over single-label classification in complex scenarios. However, it faces the challenge of annotating all relevant categories, often leading to uncertain annotations, such as unseen or incomplete labels. Recent Vision and Language Pre-training (VLP) based methods have made significant progress in tackling zero-shot MLR tasks by leveraging rich vision-language correlations. However, the correlation between multi-label semantics has not been fully explored, and the learned visual features often lack essential semantic information. To overcome these limitations, we introduce a Semantic-guided Representation Learning approach (SigRL) that enables the model to learn effective visual and textual representations, thereby improving the downstream alignment of visual images and categories. Specifically, we first introduce a graph-based multi-label correlation module (GMC) to facilitate information exchange between labels, enriching the semantic representation across the multi-label texts. Next, we propose a Semantic Visual Feature Reconstruction module (SVFR) to enhance the semantic information in the visual representation by integrating the learned textual representation during reconstruction. Finally, we optimize the image-text matching capability of the VLP model using both local and global features to achieve zero-shot MLR. Comprehensive experiments are conducted on several MLR benchmarks, encompassing both zero-shot MLR (with unseen labels) and single positive multi-label learning (with limited labels), demonstrating the superior performance of our approach compared to state-of-the-art methods. The code is available at https://github.com/MVL-Lab/SigRL.
中文: 提出的语义引导表示学习方法通过建模标签间关联并利用语义重构视觉特征,在零样本和有限标签的多标签识别任务中表现出卓越性能。
English: The proposed Semantic-guided Representation Learning (SigRL) method enhances multi-label recognition by modeling label correlations and reconstructing visual features with semantic guidance, achieving superior performance in zero-shot and limited-label scenarios.

Authors:Jose Alberto Baeza Guerra
Title: Geospatial and Symbolic Hypothesis for the Foundation of Tenochtitlan Based on Digital Elevation Analysis of the Valley of Mexico
Abstract:
This paper proposes a novel hypothesis about the foundation of Tenochtitlan by combining digital elevation modeling with historical and symbolic analysis. Using geospatial data from EarthExplorer, we simulate various historical water levels in the Valley of Mexico. The resulting lake configurations reveal possible locations for ancient settlements near now-vanished shorelines, suggesting a dynamic transformation of sacred geography that aligns with key Mexica myths. We identify Santa María Aztahuacan as a strong candidate for the historical Aztlan and propose a reinterpretation of foundational codices in light of geomythical correlations.
中文:本研究结合地理空间建模与历史分析,提出墨西哥谷地变化的湖岸线影响了墨西加人的定居模式,认定圣玛丽亚·阿兹塔瓦坎可能是阿兹特兰遗址,并通过地质神话关联重新解读了古籍抄本。
English: This study combines geospatial modeling with historical analysis to propose that shifting shorelines in the Valley of Mexico influenced Mexica settlement patterns, identifying Santa María Aztahuacan as a potential site for Aztlan and reinterpreting codices through geomythical correlations.

Authors:Qian Chen, Xingjian Dong, Zhike Peng, Guang Meng
Title: SHapley Estimated Explanation (SHEP): A Fast Post-Hoc Attribution Method for Interpreting Intelligent Fault Diagnosis
Abstract:
Despite significant progress in intelligent fault diagnosis (IFD), the lack of interpretability remains a critical barrier to practical industrial applications, driving the growth of interpretability research in IFD. Post-hoc interpretability has gained popularity due to its ability to preserve network flexibility and scalability without modifying model structures. However, these methods often yield suboptimal time-domain explanations. Recently, combining domain transform with SHAP has improved interpretability by extending explanations to more informative domains. Nonetheless, the computational expense of SHAP, exacerbated by increased dimensions from domain transforms, remains a major challenge. To address this, we propose patch-wise attribution and SHapley Estimated Explanation (SHEP). Patch-wise attribution reduces feature dimensions at the cost of explanation granularity, while SHEP simplifies subset enumeration to approximate SHAP, reducing complexity from exponential to linear. Together, these methods significantly enhance SHAP's computational efficiency, providing feasibility for real-time interpretation in monitoring tasks. Extensive experiments confirm SHEP's efficiency, interpretability, and reliability in approximating SHAP. Additionally, with open-source code, SHEP has the potential to serve as a benchmark for post-hoc interpretability in IFD. The code is available on https://github.com/ChenQian0618/SHEP.
中文摘要:提出的分块归因和沙普利估计解释(SHEP)方法在保持智能故障诊断可解释性的同时显著提升了计算效率,并通过开源代码为实际应用提供了可行性。
English summary: The proposed patch-wise attribution and SHapley Estimated Explanation (SHEP) method significantly enhances computational efficiency while maintaining interpretability for intelligent fault diagnosis, with open-source code available for practical implementation.

Authors:Brandon Radosevich, John Halloran
Title: MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
Abstract:
To reduce development overhead and enable seamless integration between potential components comprising any given generative AI application, the Model Context Protocol (MCP) (Anthropic, 2024) has recently been released and subsequently widely adopted. The MCP is an open protocol that standardizes API calls to large language models (LLMs), data sources, and agentic tools. By connecting multiple MCP servers, each defined with a set of tools, resources, and prompts, users are able to define automated workflows fully driven by LLMs. However, we show that the current MCP design carries a wide range of security risks for end users. In particular, we demonstrate that industry-leading LLMs may be coerced into using MCP tools to compromise an AI developer's system through various attacks, such as malicious code execution, remote access control, and credential theft. To proactively mitigate these and related attacks, we introduce a safety auditing tool, MCPSafetyScanner, the first agentic tool to assess the security of an arbitrary MCP server. MCPScanner uses several agents to (a) automatically determine adversarial samples given an MCP server's tools and resources; (b) search for related vulnerabilities and remediations based on those samples; and (c) generate a security report detailing all findings. Our work highlights serious security issues with general-purpose agentic workflows while also providing a proactive tool to audit MCP server safety and address detected vulnerabilities before deployment. The described MCP server auditing tool, MCPSafetyScanner, is freely available at: https://github.com/johnhalloran321/mcpSafetyScanner
模型上下文协议(MCP)虽能标准化生成式AI组件集成,却存在恶意代码执行等安全隐患,为此开发的MCPSafetyScanner可在部署前主动检测服务器漏洞。
The Model Context Protocol (MCP) standardizes generative AI components but introduces security risks like malicious code execution, prompting the development of MCPSafetyScanner to audit server vulnerabilities before deployment.

Authors:Muyun Jiang, Yi Ding, Wei Zhang, Kok Ann Colin Teo, LaiGuan Fong, Shuailei Zhang, Zhiwei Guo, Chenyu Liu, Raghavan Bhuvanakantham, Wei Khang Jeremy Sim, Chuan Huat Vince Foo, Rong Hui Jonathan Chua, Parasuraman Padmanabhan, Victoria Leong, Jia Lu, Balazs Gulyas, Cuntai Guan
Title: Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer
Abstract:
Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at https://github.com/Jiang-Muyun/FAST
中文摘要:本研究开发了功能区域时空转换器(FAST)框架,成功从脑电信号中解码隐性言语,揭示了前额叶和颞叶脑区的独特神经激活模式,为基于脑电的言语解码提供了首个可解释证据。
English Summary: This study introduces a Functional Areas Spatio-temporal Transformer (FAST) framework that successfully decodes covert speech from EEG signals, revealing distinct neural activation patterns in frontal and temporal brain regions and providing the first interpretable evidence for EEG-based speech decoding.

Authors:Shijie Ma, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu
Title: ProtoGCD: Unified and Unbiased Prototype Learning for Generalized Category Discovery
Abstract:
Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, {enabling unified modeling between old and new classes}. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at https://github.com/mashijie1028/ProtoGCD.
中文:ProtoGCD提出了一种统一的原型学习框架,通过联合建模新旧类别并采用自适应伪标记与正则化机制,解决了广义类别发现中的不平衡问题,在多个数据集上实现了最优性能。
English: ProtoGCD introduces a unified prototype learning framework that addresses imbalances in generalized category discovery by jointly modeling old and new classes with adaptive pseudo-labeling and regularization, achieving state-of-the-art performance across datasets.

Authors:Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang
Title: TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images
Abstract:
The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench
Chinese Summary: 本研究提出TDBench基准测试,通过创新的评估框架和案例研究,针对俯视图像评估视觉语言模型,解决其旋转不变性和可靠性等被忽视的问题。
English Summary: The study introduces TDBench, a benchmark for evaluating Vision Language Models on top-down images, addressing their overlooked rotational invariance and reliability issues through a novel evaluation framework and case studies.

Authors:Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang
Title: TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models
Abstract:
Top-down images play an important role in safety-critical settings such as autonomous navigation and aerial surveillance, where they provide holistic spatial information that front-view images cannot capture. Despite this, Vision Language Models (VLMs) are mostly trained and evaluated on front-view benchmarks, leaving their performance in the top-down setting poorly understood. Existing evaluations also overlook a unique property of top-down images: their physical meaning is preserved under rotation. In addition, conventional accuracy metrics can be misleading, since they are often inflated by hallucinations or "lucky guesses", which obscures a model's true reliability and its grounding in visual evidence. To address these issues, we introduce TDBench, a benchmark for top-down image understanding that includes 2000 curated questions for each rotation. We further propose RotationalEval (RE), which measures whether models provide consistent answers across four rotated views of the same scene, and we develop a reliability framework that separates genuine knowledge from chance. Finally, we conduct four case studies targeting underexplored real-world challenges. By combining rigorous evaluation with reliability metrics, TDBench not only benchmarks VLMs in top-down perception but also provides a new perspective on trustworthiness, guiding the development of more robust and grounded AI systems. Project homepage: https://github.com/Columbia-ICSL/TDBench
Chinese Summary: 本研究提出TDBench基准测试,通过创新的评估框架和案例研究,针对俯视图像评估视觉语言模型,解决其旋转不变性和可靠性等被忽视的问题。
English Summary: The study introduces TDBench, a benchmark for evaluating Vision Language Models on top-down images, addressing their overlooked rotational invariance and reliability issues through a novel evaluation framework and case studies.

Authors:Teodor Chiaburu, Felix Bießmann, Frank Haußer
Title: Uncertainty Propagation in XAI: A Comparison of Analytical and Empirical Estimators
Abstract:
Understanding uncertainty in Explainable AI (XAI) is crucial for building trust and ensuring reliable decision-making in Machine Learning models. This paper introduces a unified framework for quantifying and interpreting Uncertainty in XAI by defining a general explanation function $e_θ(x, f)$ that captures the propagation of uncertainty from key sources: perturbations in input data and model parameters. By using both analytical and empirical estimates of explanation variance, we provide a systematic means of assessing the impact uncertainty on explanations. We illustrate the approach using a first-order uncertainty propagation as the analytical estimator. In a comprehensive evaluation across heterogeneous datasets, we compare analytical and empirical estimates of uncertainty propagation and evaluate their robustness. Extending previous work on inconsistencies in explanations, our experiments identify XAI methods that do not reliably capture and propagate uncertainty. Our findings underscore the importance of uncertainty-aware explanations in high-stakes applications and offer new insights into the limitations of current XAI methods. The code for the experiments can be found in our repository at https://github.com/TeodorChiaburu/UXAI
本文提出了一个统一框架,通过分析输入数据和模型参数变化对解释的影响来量化可解释人工智能中的不确定性,揭示了现有方法的局限性,并强调了在关键应用中采用不确定性感知方法的重要性。
This paper presents a unified framework for quantifying uncertainty in Explainable AI by analyzing how input data and model parameter variations affect explanations, revealing limitations in current methods and emphasizing the need for uncertainty-aware approaches in critical applications.

Authors:Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu
Title: CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Abstract:
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
我们提出了FGRPR框架,将GRPO与模糊奖励函数结合以提高学习效率,通过提供细致激励促进精确输出,在多个数据集上超越了包括GPT4o和SFT在内的基线模型。
We propose FGRPR, a framework combining GRPO with a fuzzy reward function to improve learning efficiency, outperforming baseline models including GPT4o and SFT across datasets by providing nuanced incentives for precise outputs.

Authors:Lihui Liu, Zihao Wang, Dawei Zhou, Ruijie Wang, Yuchen Yan, Bo Xiong, Sihong He, Kai Shu, Hanghang Tong
Title: TransNet: Transfer Knowledge for Few-shot Knowledge Graph Completion
Abstract:
Knowledge graphs (KGs) are ubiquitous and widely used in various applications. However, most real-world knowledge graphs are incomplete, which significantly degrades their performance on downstream tasks. Additionally, the relationships in real-world knowledge graphs often follow a long-tail distribution, meaning that most relations are represented by only a few training triplets. To address these challenges, few-shot learning has been introduced. Few-shot KG completion aims to make accurate predictions for triplets involving novel relations when only a limited number of training triplets are available. Although many methods have been proposed, they typically learn each relation individually, overlooking the correlations between different tasks and the relevant information in previously trained tasks. In this paper, we propose a transfer learning-based few-shot KG completion method (TransNet). By learning the relationships between different tasks, TransNet effectively transfers knowledge from similar tasks to improve the current task's performance. Furthermore, by employing meta-learning, TransNet can generalize effectively to new, unseen relations. Extensive experiments on benchmark datasets demonstrate the superiority of TransNet over state-of-the-art methods. Code can be found at https://github.com/lihuiliullh/TransNet/tree/main
中文: 本文提出TransNet方法,通过迁移学习和元学习技术挖掘任务间关联性,有效提升少样本知识图谱补全任务中对新关系的预测性能,实验证明其优于现有先进方法。
English: This paper introduces TransNet, a transfer learning-based method for few-shot knowledge graph completion that leverages task relationships and meta-learning to enhance performance on novel relations, demonstrating superior results over existing approaches.

Authors:Yongyi Yang, Jianyang Gao, Wei Hu
Title: RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm
Abstract:
Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .
中文:RaanA提出了一种统一的PTQ框架,结合RaBitQ-H实现高效量化和AllocateBits优化比特分配,以极少数据和灵活比特实现优异性能。
English: RaanA introduces a unified PTQ framework with RaBitQ-H for efficient quantization and AllocateBits for optimal bit-width allocation, achieving high performance with minimal data and flexible bit usage.

Authors:Yuzhu Lei, Guanding Yu
Title: A multi-scale lithium-ion battery capacity prediction using mixture of experts and patch-based MLP
Abstract:
Lithium-ion battery health management has become increasingly important as the application of batteries expands. Precise forecasting of capacity degradation is critical for ensuring the healthy usage of batteries. In this paper, we innovatively propose MSPMLP, a multi-scale capacity prediction model utilizing the mixture of experts (MoE) architecture and patch-based multi-layer perceptron (MLP) blocks, to capture both the long-term degradation trend and local capacity regeneration phenomena. Specifically, we utilize patch-based MLP blocks with varying patch sizes to extract multi-scale features from the capacity sequence. Leveraging the MoE architecture, the model adaptively integrates the extracted features, thereby enhancing its capacity and expressiveness. Finally, the future battery capacity is predicted based on the integrated features, achieving high prediction accuracy and generalization. Experimental results on the public NASA dataset indicate that MSPMLP achieves a mean absolute error (MAE) of 0.0078, improving by 41.8\% compared to existing methods. These findings highlight that MSPMLP, owing to its multi-scale modeling capability and generalizability, provides a promising solution to the battery capacity prediction challenges caused by capacity regeneration phenomena and complex usage conditions. The code of this work is provided at https://github.com/LeiYuzhu/CapacityPredict.
中文摘要:本文创新提出MSPMLP多尺度容量预测模型,通过混合专家架构和基于补丁的多层感知器模块,有效捕捉电池长期退化趋势和局部容量再生现象,在NASA数据集上实现预测精度41.8%的提升。
English Summary: This paper introduces MSPMLP, an innovative multi-scale battery capacity prediction model that combines mixture of experts architecture with patch-based MLP blocks to accurately forecast both long-term degradation trends and local capacity regeneration, achieving a 41.8% improvement in prediction accuracy on the NASA dataset.

Authors:Zongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guan
Title: MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization
Abstract:
Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework with efficient attention kernel and pipeline design for MILLION that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed. Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization with trivial perplexity and accuracy loss, and achieve 2.09x end-to-end performance gains at 32K context length. Code is released at https://github.com/ZongwuWang/MILLION.
Chinese: 随着上下文长度的增加,大语言模型在推理速度和内存管理方面面临挑战,而MILLION框架通过创新的乘积量化方法,实现了高效的4位KV缓存压缩,在保证精度的同时显著提升了性能。
English: Large language models face challenges with inference speed and memory management as context lengths increase, but the MILLION framework introduces a novel product quantization approach that achieves efficient 4-bit KV cache compression with minimal accuracy loss and significant performance gains.

Authors:Diyaz Yakubov, David Hästbacka
Title: Comparative Analysis of Lightweight Kubernetes Distributions for Edge Computing: Performance and Resource Efficiency
Abstract:
Edge computing environments increasingly rely on lightweight container orchestration platforms to manage resource-constrained devices. This paper provides an empirical analysis of five lightweight kubernetes distributions (KD)(k0s, k3s, KubeEdge, OpenYurt, and Kubernetes (k8s)) focusing on their performance and resource efficiency in edge computing scenarios. We evaluated key metrics such as CPU, memory, disk usage, throughput, and latency under varying workloads, utilizing a testbed of Intel NUCs and Raspberry Pi devices. Our results demonstrate significant differences in performance: k3s exhibited the lowest resource consumption, while k0s and k8s excelled in data plane throughput and latency. Under heavy stress scenarios, k3s and k0s accomplished the same workloads faster than the other distributions. OpenYurt offered balanced performance, suitable for hybrid cloud-edge use cases, but was less efficient in terms of resource usage and scalability compared to k0s, k3s and k8s. KubeEdge, although feature-rich for edge environments, exhibited higher resource consumption and lower scalability. These findings offer valuable insights for developers and operators selecting appropriate KD based on specific performance and resource efficiency requirements for edge computing environments.
中文摘要:本文对五种轻量级Kubernetes发行版的实证分析表明,k3s资源消耗最低,k0s和k8s在吞吐量与延迟方面表现优异,研究结果为边缘计算场景中基于性能与资源效率的平台选择提供了关键依据。
English Summary: This empirical analysis of five lightweight Kubernetes distributions reveals that k3s has the lowest resource consumption, while k0s and k8s excel in throughput and latency, with findings guiding optimal platform selection for edge computing environments.

Authors:The AIBrix Team, Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, Rong Kang, Tongping Liu, Yifei Zhang, Yiqing Zhu, Shuowei Jin, Gangmuk Lim, Binbin Chen, Zuzhi Chen, Xiao Liu, Xin Chen, Kante Yin, Chak-Pong Chung, Chenyu Jiang, Yicheng Lu, Jianjun Chen, Caixue Lin, Wu Xiang, Rui Shi, Liguang Xie
Title: AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure
Abstract:
We introduce AIBrix, a cloud-native, open-source framework designed to optimize and simplify large-scale LLM deployment in cloud environments. Unlike traditional cloud-native stacks, AIBrix follows a co-design philosophy, ensuring every layer of the infrastructure is purpose-built for seamless integration with inference engines like vLLM. AIBrix introduces several key innovations to reduce inference costs and enhance performance including high-density LoRA management for dynamic adapter scheduling, LLM-specific autoscalers, and prefix-aware, load-aware routing. To further improve efficiency, AIBrix incorporates a distributed KV cache, boosting token reuse across nodes, leading to a 50% increase in throughput and a 70% reduction in inference latency. AIBrix also supports unified AI runtime which streamlines model management while maintaining vendor-agnostic engine compatibility. For large-scale multi-node inference, AIBrix employs hybrid orchestration -- leveraging Kubernetes for coarse-grained scheduling and Ray for fine-grained execution -- to balance efficiency and flexibility. Additionally, an SLO-driven GPU optimizer dynamically adjusts resource allocations, optimizing heterogeneous serving to maximize cost efficiency while maintaining service guarantees. Finally, AIBrix enhances system reliability with AI accelerator diagnostic tools, enabling automated failure detection and mock-up testing to improve fault resilience. AIBrix is available at https://github.com/vllm-project/aibrix.
中文: AIBrix是一个开源的云原生框架,通过动态LoRA管理、分布式KV缓存和混合编排等创新技术,显著提升了大模型部署的吞吐量并降低延迟,在保证系统可靠性的同时实现了最优成本效益。
English: AIBrix is an open-source, cloud-native framework that optimizes large-scale LLM deployment through innovations like dynamic LoRA management, distributed KV caching, and hybrid orchestration, achieving significant improvements in throughput and latency while ensuring cost efficiency and system reliability.

Authors:Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, Dongyeop Kang
Title: Align to Structure: Aligning Large Language Models with Structural Information
Abstract:
Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at https://github.com/minnesotanlp/struct_align.
中文:结构对齐方法通过强化学习将大型语言模型与人类话语结构对齐,利用细粒度奖励提升文本连贯性,在文章生成和长文档摘要等任务中优于现有模型。
English: Structural Alignment enhances long-form text generation in LLMs by aligning them with human discourse structures through reinforcement learning, using token-level rewards to improve coherence and outperforming existing models in tasks like essay writing and summarization.

Authors:Niu Lian, Jun Li, Jinpeng Wang, Ruisheng Luo, Yaowei Wang, Shu-Tao Xia, Bin Chen
Title: AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
Abstract:
Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.
中文:AutoSSVH通过对抗性采样选择信息丰富的视频帧,并结合对比学习与投票策略,显著提升了哈希码的区分度和视频检索效果。
English: AutoSSVH enhances video hashing by using adversarial sampling to select information-rich frames and employs contrastive learning with a voting strategy to improve hash code discriminability and retrieval performance.

Authors:Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Abstract:
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
中文摘要:SynWorld框架通过合成多步动作场景和执行蒙特卡洛树搜索探索,使基于大语言模型的智能体能够自主优化工作流程并增强动作理解,实验证明其在新环境中学习动作知识的有效性。
English Summary: SynWorld is a framework that enables LLM-based agents to autonomously explore novel environments by synthesizing scenarios and using Monte Carlo Tree Search to refine their action knowledge, proving effective in experimental evaluations.

Authors:Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: Agentic Knowledgeable Self-awareness
Abstract:
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
中文摘要:提出的KnowSelf范式通过情境感知使基于大语言模型的智能体能够自主调控知识运用,在不同任务中以最少的外部知识实现更优的规划效果。
English Summary: The proposed KnowSelf paradigm enables LLM-based agents to autonomously regulate knowledge usage through situational awareness, achieving superior planning performance with minimal external knowledge across various tasks.

Authors:Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang
Title: MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
Abstract:
Multilingual speech translation (ST) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, Traditional Chinese and Simplified Chinese, together with the models. With 290,000 samples, our dataset is the largest medical machine translation (MT) dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most extensive analysis study in ST research to date, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence (seq2seq) comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.
中文摘要:本研究推出了首个医疗领域大规模多语言语音翻译数据集MultiMed-ST,包含29万条五语言样本,并通过全面对比分析推动跨语言医疗交流的发展。
English Summary: This study introduces MultiMed-ST, the first large-scale multilingual speech translation dataset for the medical domain, featuring 290,000 samples across five languages and comprehensive comparative analyses to advance cross-lingual healthcare communication.

Authors:Ilan Naiman, Emanuel Ben-Baruch, Oron Anschel, Alon Shoshan, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni
Title: LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders
Abstract:
In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments. LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames. Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale. Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing. Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval. Code is available at https://github.com/amazon-science/lv-mae.
中文: LV-MAE是一种自监督学习框架,通过分离短时与长时依赖关系高效处理长视频,利用先进的多模态编码和掩码嵌入自编码器在多个基准测试中取得了最优性能。
English: LV-MAE is a self-supervised framework that efficiently processes long videos by decoupling short- and long-span dependencies, achieving state-of-the-art results on multiple benchmarks through advanced multimodal encoding and masked-embedding autoencoders.

Authors:Xi Wang, Ziqi He, Yang Zhou
Title: Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis
Abstract:
Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting
中文: 本研究提出了一种针对基于U-Net的扩散模型中Transformer模块的自适应重加权方法,能在推理过程中动态调整其重要性,从而提升信噪比,并在图像生成与编辑任务中同时提高效率与美学质量。
English: This study introduces an adaptive re-weighting method for Transformer blocks in U-Net-based diffusion models, which dynamically adjusts their importance during inference to enhance signal-to-noise ratio and improve both efficiency and aesthetic quality in image generation and editing tasks.

Authors:Nasar Iqbal, Niki Martinel
Title: Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection
Abstract:
Recent advances in convolutional neural networks (CNNs) and transformer-based methods have improved anomaly detection and localization, but challenges persist in precisely localizing small anomalies. While CNNs face limitations in capturing long-range dependencies, transformer architectures often suffer from substantial computational overheads. We introduce a state space model (SSM)-based Pyramidal Scanning Strategy (PSS) for multi-class anomaly detection and localization--a novel approach designed to address the challenge of small anomaly localization. Our method captures fine-grained details at multiple scales by integrating the PSS with a pre-trained encoder for multi-scale feature extraction and a feature-level synthetic anomaly generator. An improvement of $+1\%$ AP for multi-class anomaly localization and a +$1\%$ increase in AU-PRO on MVTec benchmark demonstrate our method's superiority in precise anomaly localization across diverse industrial scenarios. The code is available at https://github.com/iqbalmlpuniud/Pyramid Mamba.
中文: 提出的金字塔扫描策略(基于状态空间模型)通过多尺度特征提取和合成异常生成,提升了多类别异常检测与定位的精度,在MVTec基准测试中AP和AU-PRO指标均提高1%,显著改善了小异常定位能力。
English: The proposed Pyramidal Scanning Strategy (SSM-based PSS) enhances multi-class anomaly detection and localization by capturing fine-grained details at multiple scales, achieving a +1% improvement in AP and AU-PRO on the MVTec benchmark for precise small anomaly identification.

Authors:Adam Moss
Title: The AI Cosmologist I: An Agentic System for Automated Data Analysis
Abstract:
We present the AI Cosmologist, an agentic system designed to automate cosmological/astronomical data analysis and machine learning research workflows. This implements a complete pipeline from idea generation to experimental evaluation and research dissemination, mimicking the scientific process typically performed by human researchers. The system employs specialized agents for planning, coding, execution, analysis, and synthesis that work together to develop novel approaches. Unlike traditional auto machine-learning systems, the AI Cosmologist generates diverse implementation strategies, writes complete code, handles execution errors, analyzes results, and synthesizes new approaches based on experimental outcomes. We demonstrate the AI Cosmologist capabilities across several machine learning tasks, showing how it can successfully explore solution spaces, iterate based on experimental results, and combine successful elements from different approaches. Our results indicate that agentic systems can automate portions of the research process, potentially accelerating scientific discovery. The code and experimental data used in this paper are available on GitHub at https://github.com/adammoss/aicosmologist. Example papers included in the appendix demonstrate the system's capability to autonomously produce complete scientific publications, starting from only the dataset and task description
AI宇宙学家是一个自主系统,通过专业智能体实现从构思到发表的全流程自动化研究,加速宇宙学领域的科学发现。
The AI Cosmologist is an autonomous system that automates the entire scientific research workflow in cosmology, from idea generation to publication, using specialized agents to accelerate discovery.

Authors:Kaustubh Shivshankar Shejole, Pushpak Bhattacharyya
Title: StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Abstract:
Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect's effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.
This study addresses the critical need to distinguish stereotypes from stereotypical biases in AI by proposing a clear five-tuple definition and introducing StereoDetect, a carefully curated benchmark that reveals significant classification failures in current language models.
English Summary:

Authors:Denis Coquenet
Title: Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition
Abstract:
Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.
中文摘要:Meta-DAN通过引入窗口化查询和多令牌预测技术,显著降低了页面文本识别的预测时间并增强了上下文建模能力,在手写数据集上取得了领先的性能表现。
English Summary: The Meta Document Attention Network (Meta-DAN) introduces windowed queries and multi-token predictions to significantly reduce prediction time and improve context modeling in page-level text recognition, achieving state-of-the-art results on handwritten datasets.

Authors:Makoto Takamoto, Daniel Oñoro-Rubio, Wiem Ben Rim, Takashi Maruyama, Bhushan Kotnis
Title: Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction
Abstract:
Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data. Importantly, the simplicity of \textsc{EMU} ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textsc{EMU} enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at https://github.com/nec-research/EMU-KG.
中文: 本文提出EMU框架,基于理论条件生成知识图谱嵌入模型的最优负样本,显著提升了多种模型和数据集上的链接预测性能。
English: This paper introduces EMU, a framework that generates optimal negative samples for knowledge graph embedding models based on a theoretical condition, significantly improving link prediction performance across various models and datasets.

Authors:Lin yueyu, Liu Xiao
Title: RWKVTTS: Yet another TTS based on RWKV-7
Abstract:
Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team
中文: RWKV-7提出了一种基于RNN的创新架构,在文本转语音应用中超越了传统Transformer模型,在效率、速度和语音自然度上表现更优,有效提升了多语言及低资源环境下的技术普及性。
English: RWKV-7 introduces a novel RNN-based architecture for text-to-speech that surpasses transformer models in efficiency, speed, and naturalness, enhancing accessibility across diverse linguistic and low-resource settings.

Authors:Guido Barducci, Ivan Rossi, Francesco Codicè, Cesare Rollo, Valeria Repetto, Corrado Pancotti, Virginia Iannibelli, Tiziana Sanavia, Piero Fariselli
Title: JanusDDG: A Thermodynamics-Compliant Model for Sequence-Based Protein Stability via Two-Fronts Multi-Head Attention
Abstract:
Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease-related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture to predict $ΔΔG$ of single and multiple-residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self-attention, JanusDDG computes queries (Q) and values (V) as the difference between wild-type and mutant embeddings, while keys (K) alternate between the two. This cross-interleaved attention mechanism enables the model to capture mutation-induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state-of-the-art performance in predicting $ΔΔG$ from sequence alone, matching or exceeding the accuracy of structure-based methods for both single and multiple mutations. Code Availability:https://github.com/compbiomed-unito/JanusDDG
Chinese: JanusDDG是一种深度学习框架,利用蛋白质语言模型和双向交叉注意力变换器,在遵循热力学原理的同时准确预测单点和多点突变的ΔΔG,仅凭序列数据即达到最先进的性能水平。
English: JanusDDG is a deep learning framework that utilizes protein language models and a bidirectional cross-attention transformer to accurately predict ΔΔG for single and multiple mutations while adhering to thermodynamic principles, achieving state-of-the-art performance from sequence data alone.

Authors:Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
Title: SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
Abstract:
Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.
中文:SARLANG-1M是一个大规模多模态基准数据集,通过提供超过100万个SAR图像-文本对,有效弥补了SAR图像解译的不足,使得视觉语言模型在微调后能达到接近人类专家的性能水平。
English: SARLANG-1M is a large-scale multimodal benchmark designed to bridge the gap in SAR image interpretation by providing over 1 million image-text pairs, which significantly enhances the performance of Vision-Language Models to near-human expert levels after fine-tuning.

Authors:Lifan Hu
Title: Learning Lie Group Generators from Trajectories
Abstract:
This work investigates the inverse problem of generator recovery in matrix Lie groups from discretized trajectories. Let $G$ be a real matrix Lie group and $\mathfrak{g} = \text{Lie}(G)$ its corresponding Lie algebra. A smooth trajectory $γ($t$)$ generated by a fixed Lie algebra element $ξ\in \mathfrak{g}$ follows the exponential flow $γ($t$) = g_0 \cdot \exp(t ξ)$. The central task addressed in this work is the reconstruction of such a latent generator $ξ$ from a discretized sequence of poses $ \{g_0, g_1, \dots, g_T\} \subset G$, sampled at uniform time intervals. This problem is formulated as a data-driven regression from normalized sequences of discrete Lie algebra increments $\log\left(g_{t}^{-1} g_{t+1}\right)$ to the constant generator $ξ\in \mathfrak{g}$. A feedforward neural network is trained to learn this mapping across several groups, including $\text{SE(2)}, \text{SE(3)}, \text{SO(3)}, and \text{SL(2,$\mathbb{R})$}$. It demonstrates strong empirical accuracy under both clean and noisy conditions, which validates the viability of data-driven recovery of Lie group generators using shallow neural architectures. This is Lie-RL GitHub Repo https://github.com/Anormalm/LieRL-on-Trajectories. Feel free to make suggestions and collaborations!
本研究利用神经网络开发了一种数据驱动方法,能够从矩阵李群的离散化轨迹中准确恢复潜在生成器,并在不同条件下展现出鲁棒性能。
This study develops a data-driven approach using a neural network to accurately recover the latent generator from discretized trajectories in matrix Lie groups, demonstrating robust performance under various conditions.

Authors:Thomas Daniel, Malgorzata Olejniczak, Julien Tierny
Title: BondMatcher: H-Bond Stability Analysis in Molecular Systems
Abstract:
This application paper investigates the stability of hydrogen bonds (H-bonds), as characterized by the Quantum Theory of Atoms in Molecules (QTAIM). First, we contribute a database of 4544 electron densities associated to four isomers of water hexamers (the so-called Ring, Book, Cage and Prism), generated by distorting their equilibrium geometry under various structural perturbations, modeling the natural dynamic behavior of molecular systems. Second, we present a new stability measure, called bond occurrence rate, associating each bond path present at equilibrium with its rate of occurrence within the input ensemble. We also provide an algorithm, called BondMatcher, for its automatic computation, based on a tailored, geometry-aware partial isomorphism estimation between the extremum graphs of the considered electron densities. Our new stability measure allows for the automatic identification of densities lacking H-bond paths, enabling further visual inspections. Specifically, the topological analysis enabled by our framework corroborates experimental observations and provides refined geometrical criteria for characterizing the disappearance of H-bond paths. Our electron density database and our C++ implementation are available at this address: https://github.com/thom-dani/BondMatcher.
中文摘要:本研究通过量子理论中的原子分子方法,提出键出现率指标及BondMatcher算法分析水六聚体中氢键稳定性,其拓扑分析验证了实验观测结果并为氢键路径消失提供了精确的几何判定标准。
English Summary: This study introduces a bond occurrence rate measure and BondMatcher algorithm to analyze hydrogen bond stability in water hexamers using QTAIM, with topological analysis confirming experimental observations and providing refined criteria for bond path disappearance.

Authors:Xin Zhang, Robby T. Tan
Title: Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Abstract:
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.
中文: 提出的MFuser框架通过基于Mamba的融合机制,有效结合视觉基础模型和视觉语言模型的互补优势,在保持线性计算复杂度的同时实现了领域泛化语义分割的卓越性能。
English: The proposed MFuser framework effectively integrates Vision Foundation Models and Vision-Language Models through Mamba-based fusion to achieve superior domain generalization in semantic segmentation with linear computational scalability.

Authors:Zeyang Zheng, Arman Hosseini, Dong Chen, Omid Shoghli, Arsalan Heydarian
Title: Real-Time Roadway Obstacle Detection for Electric Scooters Using Deep Learning and Multi-Sensor Fusion
Abstract:
The increasing adoption of electric scooters (e-scooters) in urban areas has coincided with a rise in traffic accidents and injuries, largely due to their small wheels, lack of suspension, and sensitivity to uneven surfaces. While deep learning-based object detection has been widely used to improve automobile safety, its application for e-scooter obstacle detection remains unexplored. This study introduces a novel ground obstacle detection system for e-scooters, integrating an RGB camera, and a depth camera to enhance real-time road hazard detection. Additionally, the Inertial Measurement Unit (IMU) measures linear vertical acceleration to identify surface vibrations, guiding the selection of six obstacle categories: tree branches, manhole covers, potholes, pine cones, non-directional cracks, and truncated domes. All sensors, including the RGB camera, depth camera, and IMU, are integrated within the Intel RealSense Camera D435i. A deep learning model powered by YOLO detects road hazards and utilizes depth data to estimate obstacle proximity. Evaluated on the seven hours of naturalistic riding dataset, the system achieves a high mean average precision (mAP) of 0.827 and demonstrates excellent real-time performance. This approach provides an effective solution to enhance e-scooter safety through advanced computer vision and data fusion. The dataset is accessible at https://zenodo.org/records/14583718, and the project code is hosted on https://github.com/Zeyang-Zheng/Real-Time-Roadway-Obstacle-Detection-for-Electric-Scooters.
中文: 本研究为电动滑板车开发了一种集成RGB深度相机和IMU的实时地面障碍物检测系统,通过基于YOLO的深度学习模型实现了对道路危险的高精度识别。
English: This study develops a real-time ground obstacle detection system for electric scooters using an integrated RGB-depth camera and IMU, achieving high accuracy in identifying road hazards through a YOLO-based deep learning model.

Authors:Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu
Title: Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge injection during large language model (LLM) inference in recent years. However, due to their limited ability to exploit fine-grained inter-document relationships, current RAG implementations face challenges in effectively addressing the retrieved noise and redundancy content, which may cause error in the generation results. To address these limitations, we propose an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5-Turbo and GPT-4o-mini, on widely used knowledge-QA and Hallucination-Detection datasets. Experimental results show that our method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets are available at https://github.com/Tsinghua-dhy/EDC-2-RAG.
中文: 提出的EDC2-RAG框架通过动态文档聚类来优化检索增强生成技术,有效消除冗余噪声,在GPT模型上的多场景测试中均展现出稳定的性能提升。
English: The proposed EDC2-RAG framework enhances Retrieval-Augmented Generation by dynamically clustering documents to reduce noise and redundancy, demonstrating consistent performance improvements across multiple benchmarks with GPT models.

Authors:Zihan Gu, Ruoyu Chen, Hua Zhang, Yue Hu, Xiaochun Cao
Title: Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking
Abstract:
Grokking, referring to the abrupt improvement in test accuracy after extended overfitting, offers valuable insights into the mechanisms of model generalization. Existing researches based on progress measures imply that grokking relies on understanding the optimization dynamics when the loss function is dominated solely by the weight decay term. However, we find that this optimization merely leads to token uniformity, which is not a sufficient condition for grokking. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations. Based on theoretical analysis and experimental validation, we present the following insights: (i) The weight decay term encourages uniformity across all tokens in the embedding space when it is minimized. (ii) The occurrence of grokking is jointly determined by the uniformity of the embedding space and the distribution of the training dataset. Building on these insights, we provide a unified perspective for understanding various previously proposed progress measures and introduce a novel, concise, and effective progress measure that could trace the changes in test loss more accurately. Finally, to demonstrate the versatility of our theoretical framework, we design a dedicated dataset to validate our theory on ResNet-18, successfully showcasing the occurrence of grokking. The code is released at https://github.com/Qihuai27/Grokking-Insight.
Chinese: 本研究揭示了Transformer在素数运算任务中顿悟现象的产生源于嵌入空间均匀性与训练数据分布的协同作用,并提出了新的进展度量指标,同时在ResNet-18上验证了理论框架的普适性。
English: This study reveals that grokking in Transformers during prime number operations arises from the combined effect of embedding space uniformity and training data distribution, leading to a new progress measure and validation on ResNet-18.

Authors:Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, Pengfei Liu
Title: DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Abstract:
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.
中文: DeepResearcher是一种通过真实网络环境中的强化学习来训练大语言模型成为深度研究代理的创新框架,其性能显著超越现有方法,并展现出规划验证、自我反思等新兴认知能力。
English: DeepResearcher is a novel framework that trains large language models as deep research agents through reinforcement learning in real-world web environments, significantly outperforming existing methods and demonstrating emergent cognitive behaviors for robust information gathering.

Authors:Haozhan Tang, Tianyi Zhang, Oliver Kroemer, Matthew Johnson-Roberson, Weiming Zhi
Title: GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction
Abstract:
Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.
中文: GraphSeg框架通过构建双重对应图并利用3D基础模型,仅从稀疏的2D图像即可生成一致的3D物体分割,在机器人操作任务中实现了超越现有方法的准确性和性能表现。
English: GraphSeg is a framework that generates consistent 3D object segmentations from sparse 2D images without depth information by constructing dual correspondence graphs and leveraging 3D foundation models, achieving superior accuracy and performance in robotic manipulation tasks.

Authors:Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, Jiantao Zhou
Title: FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge
Abstract:
The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at https://github.com/KAHIMWONG/FontGuard.
中文摘要:FontGuard是一种创新的字体水印模型,通过修改隐藏样式特征嵌入信息,在保持字体高质量的同时显著提升了抗失真能力和解码准确率。
English Summary: FontGuard is a novel font watermarking model that enhances text security by embedding hidden style features, achieving superior visual quality and robustness against real-world distortions compared to existing methods.

Authors:Weili Cao, Jianyou Wang, Youze Zheng, Longtian Bao, Qirui Zheng, Taylor Berg-Kirkpatrick, Ramamohan Paturi, Leon Bergen
Title: Single-Pass Document Scanning for Question Answering
Abstract:
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever
中文: 提出的单遍文档扫描方法以线性时间处理整个文本,保持全局连贯性,在问答基准测试中优于分块方法,且计算成本仅为大型语言模型的一小部分。
English: The proposed single-pass document scanning method efficiently processes entire documents in linear time, preserving global coherence and outperforming chunk-based approaches on QA benchmarks at a fraction of the computational cost.

Authors:Daniel M. Cherenson, Devansh R. Agrawal, Dimitra Panagou
Title: Autonomy Architectures for Safe Planning in Unknown Environments Under Budget Constraints
Abstract:
Mission planning can often be formulated as a constrained control problem under multiple path constraints (i.e., safety constraints) and budget constraints (i.e., resource expenditure constraints). In a priori unknown environments, verifying that an offline solution will satisfy the constraints for all time can be difficult, if not impossible. Our contributions are as follows: 1) We propose an online method, building on our previous work "gatekeeper", to guarantee safety and satisfy budget constraints of the system trajectory at all times throughout a mission. 2) Next, we prove that our algorithm is recursively feasible and correct. 3) Finally, instead of using a heuristically designed backup controller, we propose a sampling-based method to construct backup trajectories that both minimize resource expenditure and reach budget renewal sets, in which path constraints are satisfied and the constrained resources are renewed. We demonstrate our approach in simulation with a fixed-wing UAV in a GNSS-denied environment with a budget constraint on localization error that can be renewed at visual landmarks.
Chinese: ReRoot是一种新颖的基于采样的框架,通过在未知环境中从预算可重置的更新集在线生长多个反向RRT*树,提供动态可行的备用轨迹来保证安全性并减少资源消耗。
English: ReRoot is a novel sampling-based framework that enforces safety and budget constraints for nonlinear systems in unknown environments by growing multiple reverse RRT* trees online from renewal sets, providing dynamically feasible backup trajectories to guarantee safety and reduce resource expenditure.

Authors:Daniel M. Cherenson, Devansh R. Agrawal, Dimitra Panagou
Title: Autonomy Architectures for Safe Planning in Unknown Environments Under Budget Constraints
Abstract:
Mission planning can often be formulated as a constrained control problem under multiple path constraints (i.e., safety constraints) and budget constraints (i.e., resource expenditure constraints). In a priori unknown environments, verifying that an offline solution will satisfy the constraints for all time can be difficult, if not impossible. We present ReRoot, a novel sampling-based framework that enforces safety and budget constraints for nonlinear systems in unknown environments. The main idea is that ReRoot grows multiple reverse RRT* trees online, starting from renewal sets, i.e., sets where the budget constraints are renewed. The dynamically feasible backup trajectories guarantee safety and reduce resource expenditure, which provides a principled backup policy when integrated into the gatekeeper safety verification architecture. We demonstrate our approach in simulation with a fixed-wing UAV in a GNSS-denied environment with a budget constraint on localization error that can be renewed at visual landmarks.
Chinese: ReRoot是一种新颖的基于采样的框架,通过在未知环境中从预算可重置的更新集在线生长多个反向RRT*树,提供动态可行的备用轨迹来保证安全性并减少资源消耗。
English: ReRoot is a novel sampling-based framework that enforces safety and budget constraints for nonlinear systems in unknown environments by growing multiple reverse RRT* trees online from renewal sets, providing dynamically feasible backup trajectories to guarantee safety and reduce resource expenditure.

Authors:Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Title: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Abstract:
In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.
中文:VARGPT-v1.1作为升级版统一视觉自回归模型,通过创新的训练策略和扩展数据集,在多模态理解和文本到图像生成任务中达到最优性能,并在保持架构不变的情况下展现出图像编辑的新能力。
English: VARGPT-v1.1 is an enhanced unified visual autoregressive model that achieves state-of-the-art performance in multimodal understanding and text-to-image generation through novel training strategies and expanded datasets, while demonstrating emergent image editing capabilities without architectural changes.

Authors:Rohit Agarwal, Aryan Dessai, Arif Ahmed Sekh, Krishna Agarwal, Alexander Horsch, Dilip K. Prasad
Title: Haphazard Inputs as Images in Online Learning
Abstract:
The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at https://github.com/Rohit102497/HaphazardInputsAsImages.
中文: 本文提出了一种与模型无关的方法,将变化特征空间实时转换为固定维度的图像表示,使得ResNet和ViT等先进视觉模型能够处理随机输入,并在四个数据集上验证了其有效性。
English: This paper introduces a model-agnostic method that converts varying feature spaces into fixed-dimension image representations, enabling the use of advanced vision models like ResNet and ViT for haphazard inputs and demonstrating effectiveness across four datasets.

Authors:Zhihan Zhang, Yixin Cao, Lizi Liao
Title: Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement
Abstract:
Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.
中文摘要:本文提出了一种双重偏好引导的优化框架,通过结合视觉与代码奖励的迭代偏好学习,显著提升了多模态大语言模型在图表转代码任务中的性能,使其达到可与专业系统相媲美的水平。
English Summary: This paper introduces a dual preference-guided refinement framework that enhances Multimodal Large Language Models for chart-to-code generation by combining visual and code rewards through iterative preference learning, significantly improving performance to rival specialized systems.

Authors:Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
Title: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Abstract:
Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.
中文摘要:后训练通过调整和扩展知识表征而不改变事实存储位置来增强大语言模型,同时揭示了有助于模型引导与可解释性的差异化真实性表达与拒绝机制。
English Summary: Post-training enhances large language models by adapting and developing knowledge representations without altering factual storage locations, while revealing distinct truthfulness and refusal mechanisms that aid in model steering and interpretability.

Authors:Sudong Wang, Yunjian Zhang, Yao Zhu, Jianing Li, Zizhe Wang, Yanwei Liu, Xiangyang Ji
Title: Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at https://github.com/XIAO4579/Vlm-interpretability.
中文摘要:本研究揭示了大型视觉语言模型中多模态知识的演化轨迹,通过识别关键层和突变层将演化过程划分为三个阶段,为理解其内在机制提供了新视角。
English Summary: This study investigates the evolution of multimodal knowledge in Large Vision-Language Models, identifying critical and mutation layers that divide the process into three distinct stages and offering new insights into their internal mechanisms.

Authors:Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan
Title: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Abstract:
Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-4o-Image, achieves an accuracy of merely 28.8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https://github.com/PhoenixZ810/RISEBench.
中文: 大型多模态模型在推理式视觉编辑任务中表现不佳,新推出的RISEBench基准测试显示,即便是最优模型GPT-4o-Image的准确率也仅为28.8%,充分暴露了当前模型在指令推理和外观一致性方面存在的明显缺陷。
English: Large Multi-modality Models struggle with reasoning-based visual editing tasks, as demonstrated by the newly introduced RISEBench benchmark where even top-performing models like GPT-4o-Image achieve only 28.8% accuracy, highlighting significant limitations in instruction reasoning and appearance consistency.

Authors:Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
Title: Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Abstract:
Given that interpretability and steerability are crucial to AI safety, Sparse Autoencoders (SAEs) have emerged as a tool to enhance them in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Notably, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code is available at https://github.com/ExplainableML/sae-for-vlm.
Chinese: 稀疏自编码器(SAEs)应用于视觉语言模型(VLMs),通过增强神经元层面的单义性并直接引导模型输出,无需修改底层架构即可提升可解释性和控制能力。
English: Sparse Autoencoders (SAEs) are applied to Vision-Language Models (VLMs) to improve neuron-level monosemanticity and enable direct steering of model outputs, enhancing interpretability and control without modifying the underlying architecture.

Authors:Yuexi Du, Jiazhen Zhang, Nicha C. Dvornek, John A. Onofrey
Title: GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings
Abstract:
Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs' performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at https://github.com/XYPB/GMR-Conv.
中文摘要:GMR-Conv通过高斯混合环形卷积核实现了高效且鲁棒的旋转与反射等变性,在多个数据集上超越传统卷积网络和先进等变学习方法,且无需额外计算成本。
English Summary: GMR-Conv introduces an efficient Gaussian mixture-based convolution kernel that achieves robust rotation and reflection equivariance without computational penalties, outperforming conventional CNNs and state-of-the-art methods in various tasks.

Authors:Jay N. Paranjape, Celso de Melo, Vishal M. Patel
Title: F-ViTA: Foundation Model Guided Visible to Thermal Translation
Abstract:
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.
中文: F-ViTA提出了一种创新的可见光到热成像转换方法,利用基础模型指导扩散过程,在多个数据集上表现卓越,并能有效泛化至分布外场景。
English: F-ViTA introduces a novel visible-to-thermal image translation method that utilizes foundation models to guide diffusion processes, achieving superior performance across multiple datasets and generalizing effectively to out-of-distribution scenarios.

Authors:Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan
Title: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Abstract:
The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.
中文: 本报告推出首个评估基准GPT-ImgEval,系统评估GPT-4o在图像生成质量、编辑能力和语义合成上的卓越表现,揭示其自回归扩散架构特点与生成局限,同时提供对比研究和安全分析。
English: This report introduces GPT-ImgEval, the first benchmark to evaluate GPT-4o's superior image generation, editing, and semantic synthesis capabilities, revealing its auto-regressive diffusion architecture and limitations while providing comparative analysis and safety insights.

Authors:Alexander Leszczynski, Sarah Gillet, Iolanda Leite, Fethiye Irmak Dogan
Title: BT-ACTION: A Test-Driven Approach for Modular Understanding of User Instruction Leveraging Behaviour Trees and LLMs
Abstract:
Natural language instructions are often abstract and complex, requiring robots to execute multiple subtasks even for seemingly simple queries. For example, when a user asks a robot to prepare avocado toast, the task involves several sequential steps. Moreover, such instructions can be ambiguous or infeasible for the robot or may exceed the robot's existing knowledge. While Large Language Models (LLMs) offer strong language reasoning capabilities to handle these challenges, effectively integrating them into robotic systems remains a key challenge. To address this, we propose BT-ACTION, a test-driven approach that combines the modular structure of Behavior Trees (BT) with LLMs to generate coherent sequences of robot actions for following complex user instructions, specifically in the context of preparing recipes in a kitchen-assistance setting. We evaluated BT-ACTION in a comprehensive user study with 45 participants, comparing its performance to direct LLM prompting. Results demonstrate that the modular design of BT-ACTION helped the robot make fewer mistakes and increased user trust, and participants showed a significant preference for the robot leveraging BT-ACTION. The code is publicly available at https://github.com/1Eggbert7/BT_LLM.
中文:BT-ACTION将行为树与大语言模型结合,为复杂指令生成连贯的机器人动作序列,在厨房任务中减少错误并提升用户信任度。
English: BT-ACTION integrates Behavior Trees with Large Language Models to generate coherent robot actions for complex instructions, reducing errors and increasing user trust in kitchen tasks.

Authors:Vincent Gbouna Zakka, Luis J. Manso, Zhuangzhuang Dai
Title: Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition
Abstract:
Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on "fixed" kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: https://github.com/Gbouna/MAK-GCN
中文: 本研究提出了一种图卷积网络中的多头自适应核模块,能动态处理稀疏的毫米波雷达点云,在保护隐私的同时实现了最先进的人类活动识别性能。
English: This study introduces a Multi-Head Adaptive Kernel module within graph convolutional networks to dynamically process sparse mmWave radar point clouds, achieving state-of-the-art human activity recognition while preserving privacy.

Authors:Feng Gao, Miao Fu, Jingchao Cao, Junyu Dong, Qian Du
Title: Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation
Abstract:
Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.
中文:提出的AFENet模型通过自适应频率空间交互模块和选择性特征融合模块,有效整合频域与空间特征以提升遥感图像语义分割精度,在多个公开数据集上验证了其优越性能。
English: The proposed AFENet model enhances semantic segmentation of remote sensing images by adaptively integrating frequency and spatial features through its AFSIM and SFM modules, demonstrating superior performance over existing methods across multiple datasets.

Authors:Leonardo Iurada, Marco Ciccone, Tatiana Tommasi
Title: Efficient Model Editing with Task-Localized Sparse Fine-tuning
Abstract:
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
中文: TaLoS提出了一种构建稀疏任务向量的方法,通过仅更新梯度敏感性低的参数子集来提升训练和推理效率,无需线性化即可在任务算术中超越现有方法。
English: TaLoS introduces a method for creating sparse task vectors that enhance training and inference efficiency by updating only a subset of parameters with low gradient sensitivity, outperforming existing approaches in task arithmetic without requiring linearization.

Authors:Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia
Title: PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation
Abstract:
RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Code and trained models are available at https://github.com/foollh/PicoPose.
中文: PicoPose采用三阶段像素级对应学习框架,通过逐步优化特征匹配,在机器人应用中实现了针对新物体的零样本姿态估计的最先进性能。
English: PicoPose is a three-stage pixel-to-pixel correspondence learning framework that progressively refines feature matching to achieve state-of-the-art zero-shot pose estimation for novel objects in robotic applications.

Authors:Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia
Title: PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation
Abstract:
RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Code and trained models are available at https://github.com/foollh/PicoPose.
中文: PicoPose采用三阶段像素级对应学习框架,通过逐步优化特征匹配,在机器人应用中实现了针对新物体的零样本姿态估计的最先进性能。
English: PicoPose is a three-stage pixel-to-pixel correspondence learning framework that progressively refines feature matching to achieve state-of-the-art zero-shot pose estimation for novel objects in robotic applications.

Authors:Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu
Title: Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Abstract:
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
中文摘要:本研究为视觉语言模型提出了一个透明的强化学习框架,并通过实验验证了关键发现,如强化学习在泛化能力上优于监督微调,以及响应长度对性能的影响。
English Summary: This study introduces a transparent reinforcement learning framework for vision-language models, validated through experiments that reveal key insights like RL's superior generalization over supervised fine-tuning and the influence of response length on performance.

Authors:Andrei Dumitriu, Florin Tatui, Florin Miron, Radu Tudor Ionescu, Radu Timofte
Title: Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results
Abstract:
Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of $88.94%$ on the validation dataset and $81.21%$ macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.
中文摘要:本文针对离岸流实例分割这一新任务,发布了详细标注的数据集,通过YOLOv8-nano模型实现了高精度检测,为后续研究奠定基础并提升海滩安全防护能力。
English Summary: This paper introduces a novel rip current instance segmentation task, presenting comprehensive datasets and achieving high detection accuracy with the YOLOv8-nano model, providing a baseline for future research and enhancing beach safety.

Authors:Hesong Li, Ziqi Wu, Ruiwen Shao, Tao Zhang, Ying Fu
Title: Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement
Abstract:
Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at https://github.com/HeasonLee/SFIN}{https://github.com/HeasonLee/SFIN.
Chinese: 本文提出了噪声校准与数据合成方法以生成更逼真的STEM图像,并设计了一种空间-频率交互网络,利用频域信息显著提升了图像清晰度与增强效果。
English: This paper introduces noise calibration and data synthesis methods to create more realistic STEM images and develops a spatial-frequency interactive network that leverages frequency domain information for enhanced image clarity and performance.

Authors:Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang
Title: GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
Abstract:
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.
中文: 强化学习无需大量依赖监督微调即可提升大语言模型的推理能力,而提出的极简主义分组策略梯度方法通过省去复杂组件简化了训练过程,在各类任务中实现了更优性能。
English: Reinforcement Learning can enhance large language models' reasoning without heavy reliance on Supervised Fine-Tuning, and the proposed minimalist Group Policy Gradient method simplifies training by eliminating complex components, achieving superior performance across tasks.

Authors:Fatemeh Behrad, Tinne Tuytelaars, Johan Wagemans
Title: Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
Abstract:
The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at https://github.com/FBehrad/Charm.
中文摘要:Charm提出了一种新颖的标记化方法,能同时保留图像的构图、高分辨率、宽高比和多尺度信息,使视觉变换器能够在不损失关键信息的前提下高效处理不同尺寸的输入,从而显著提升图像美学评估任务的性能表现。
English Summary: Charm introduces a novel tokenization method that preserves key image attributes like composition and high-resolution details, enabling Vision transformers to process variable-sized inputs efficiently without information loss, thereby significantly improving performance in image aesthetic assessment tasks.

Authors:Nedko Savov, Naser Kazemi, Mohammad Mahdi, Danda Pani Paudel, Xi Wang, Luc Van Gool
Title: Exploration-Driven Generative Interactive Environments
Abstract:
Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.
Chinese: 该研究提出了一种在虚拟环境中使用随机代理的训练框架,并引入了AutoExplore代理,通过利用世界模型的不确定性生成多样化数据,以提升在新环境中的适应性和性能,同时基于RetroAct数据集和增强的GenieRedux-G模型实现。
English: The study introduces a training framework using a random agent in virtual environments and proposes the AutoExplore Agent, which leverages world model uncertainty to generate diverse data for improved adaptability and performance in new environments, supported by the RetroAct dataset and an enhanced GenieRedux-G model.

Authors:Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, Yunhong Wang
Title: APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
Abstract:
Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.
Chinese: 视觉Transformer(ViT)在超低位量化时存在显著精度下降问题,APHQ-ViT通过改进的基于Hessian的重要性评估和MLP重构方法,在3位和4位量化设置中实现了最先进的性能表现。
English: Vision Transformers (ViTs) face significant accuracy loss during ultra-low bit quantization, which APHQ-ViT addresses through improved Hessian-based importance estimation and MLP reconstruction to achieve state-of-the-art performance in 3-bit and 4-bit settings.

Authors:Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra
Title: ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Abstract:
Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.
中文摘要:ZClip是一种自适应梯度裁剪算法,通过基于z分数的异常检测动态调整阈值,有效防止大语言模型训练中的梯度不稳定和损失峰值问题,同时不影响模型收敛。
English Summary: ZClip is an adaptive gradient clipping algorithm that dynamically adjusts thresholds using z-score-based anomaly detection to prevent gradient instability and loss spikes in large language model training without hindering convergence.

Authors:Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, Shibiao Xu
Title: Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Abstract:
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
中文: 本综述系统梳理了多模态融合在机器人视觉中的应用,对比视觉语言模型与传统方法,分析数据集并指出关键研究挑战,以推动多模态感知技术的发展。
English: This survey systematically reviews multimodal fusion applications in robotic vision, comparing vision-language models with traditional methods while analyzing datasets and identifying key research challenges to advance multimodal perception.

Authors:Rick van Essen, Eldert van Henten, Lammert Kooistra, Gert Kootstra
Title: Adaptive path planning for efficient object search by UAVs in agricultural fields
Abstract:
This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors, and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field. When the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with a high number of objects. Overall, the presented adaptive path planner allowed finding non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at https://github.com/wur-abe/uav_adaptive_planner.
中文: 本文提出了一种用于农田目标搜索的自适应无人机路径规划器,通过结合高空覆盖飞行和检测不确定时的低空巡检,在非均匀分布目标场景下展现出比全覆盖路径更短的飞行距离,同时对定位误差具有良好鲁棒性。
English: This paper introduces an adaptive UAV path planner for agricultural object search that combines high-altitude coverage with targeted low-altitude inspections triggered by detection uncertainty, demonstrating superior efficiency for non-uniform object distributions while maintaining robust performance against localization errors.

Authors:Vladimir Slaykovskiy, Maksim Zvegintsev, Yury Sakhonchyk, Hrachik Ajamian
Title: Evaluating AI Recruitment Sourcing Tools by Human Preference
Abstract:
This study introduces a benchmarking methodology designed to evaluate the performance of AI-driven recruitment sourcing tools. We created and utilized a dataset to perform a comparative analysis of search results generated by leading AI-based solutions, LinkedIn Recruiter, and our proprietary system, Pearch.ai. Human experts assessed the relevance of the returned candidates, and an Elo rating system was applied to quantitatively measure each tool's comparative performance. Our findings indicate that AI-driven recruitment sourcing tools consistently outperform LinkedIn Recruiter in candidate relevance, with Pearch.ai achieving the highest performance scores. Furthermore, we found a strong alignment between AI-based evaluations and human judgments, highlighting the potential for advanced AI technologies to substantially enhance talent acquisition effectiveness. Code and supporting data are publicly available at https://github.com/vslaykovsky/ai-sourcing-benchmark
中文摘要:本研究对AI驱动的招聘寻源工具进行基准测试,发现其在候选人相关性方面优于LinkedIn Recruiter,其中Pearch.ai表现最佳,同时证实AI评估与人工判断高度一致,可显著提升人才获取效能。
English Summary: This study benchmarks AI-driven recruitment sourcing tools, finding they outperform LinkedIn Recruiter in candidate relevance with Pearch.ai achieving top performance, while demonstrating strong alignment between AI and human evaluations to enhance talent acquisition.

Authors:Changshuo Wang, Shuting He, Xiang Fang, Meiqing Wu, Siew-Kei Lam, Prayag Tiwari
Title: Taylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation
Abstract:
Few-shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.
中文: 提出的TaylorSeg网络采用无需预训练的方法,通过TaylorConv将不规则点云的局部结构建模为多项式拟合,在少样本点云语义分割中实现了领先性能,避免了预训练的时间开销。
English: The proposed TaylorSeg network introduces a pretraining-free approach for few-shot point cloud semantic segmentation, utilizing TaylorConv to model local structures as polynomial fittings, achieving state-of-the-art performance without pretraining overhead.

Authors:Jiayi Gao, Zijin Yin, Changcheng Hua, Yuxin Peng, Kongming Liang, Zhanyu Ma, Jun Guo, Yang Liu
Title: ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
Abstract:
The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.
中文: ConMo是一种零样本框架,能从源视频中分离并重组主体和相机运动,实现对多样化主体的精确运动控制,并在多主体场景中提升视频生成的保真度和语义一致性。
English: ConMo is a zero-shot framework that disentangles and recomposes subject and camera motions from source videos, enabling precise motion control across diverse subjects and improving multi-subject video generation with enhanced fidelity and semantic consistency.

Authors:Jingyi Wang, Duanfeng Chu, Zejian Deng, Liping Lu, Jinxiang Wang, Chen Sun
Title: CHARMS: A Cognitive Hierarchical Agent for Reasoning and Motion Stylization in Autonomous Driving
Abstract:
To address the challenge of insufficient interactivity and behavioral diversity in autonomous driving decision-making, this paper proposes a Cognitive Hierarchical Agent for Reasoning and Motion Stylization (CHARMS). By leveraging Level-k game theory, CHARMS captures human-like reasoning patterns through a two-stage training pipeline comprising reinforcement learning pretraining and supervised fine-tuning. This enables the resulting models to exhibit diverse and human-like behaviors, enhancing their decision-making capacity and interaction fidelity in complex traffic environments. Building upon this capability, we further develop a scenario generation framework that utilizes the Poisson cognitive hierarchy theory to control the distribution of vehicles with different driving styles through Poisson and binomial sampling. Experimental results demonstrate that CHARMS is capable of both making intelligent driving decisions as an ego vehicle and generating diverse, realistic driving scenarios as environment vehicles. The code for CHARMS is released at https://github.com/chuduanfeng/CHARMS.
中文摘要:本文提出CHARMS认知分层代理,利用Level-k博弈论通过两阶段训练实现类人推理与多样化行为,实验证明其既能作为主车做出智能驾驶决策,也能生成多样真实的环境车辆驾驶场景。
English Summary: This paper introduces CHARMS, a cognitive hierarchical agent that uses Level-k game theory to enhance autonomous driving decision-making with human-like reasoning and diverse behaviors, validated through experiments for both intelligent driving decisions and realistic scenario generation.

Authors:Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
Title: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Abstract:
Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.
Chinese: ViLAMP通过差分蒸馏方法,在关键帧中保留完整信息并压缩非关键帧特征,实现了在单个GPU上高效处理超长视频的同时保持顶尖性能。
English: ViLAMP introduces differential distillation to efficiently process long videos by preserving key information in keyframes and compressing non-keyframes, achieving state-of-the-art performance with high computational efficiency on a single GPU.

Authors:Mario Kahlhofer, Matteo Golinelli, Stefan Rass
Title: Koney: A Cyber Deception Orchestration Framework for Kubernetes
Abstract:
System operators responsible for protecting software applications remain hesitant to implement cyber deception technology, including methods that place traps to catch attackers, despite its proven benefits. Overcoming their concerns removes a barrier that currently hinders industry adoption of deception technology. Our work introduces deception policy documents to describe deception technology "as code" and pairs them with Koney, a Kubernetes operator, which facilitates the setup, rotation, monitoring, and removal of traps in Kubernetes. We leverage cloud-native technologies, such as service meshes and eBPF, to automatically add traps to containerized software applications, without having access to the source code. We focus specifically on operational properties, such as maintainability, scalability, and simplicity, which we consider essential to accelerate the adoption of cyber deception technology and to facilitate further research on cyber deception.
中文: 本研究提出欺骗策略文档和Koney Kubernetes操作器,利用云原生技术自动在容器化应用中部署和管理网络欺骗陷阱,通过解决可维护性、可扩展性等运维问题来推动该技术的行业应用。
English: This work introduces deception policy documents and the Koney Kubernetes operator to automate the deployment and management of cyber deception traps in containerized applications using cloud-native technologies, addressing operational concerns to promote industry adoption.

Authors:Peifu Liu, Huiyan Bai, Tingfa Xu, Jihui Wang, Huan Chen, Jianan Li
Title: Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline
Abstract:
The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at https://github.com/laprf/HRSSD.
中文: 本文提出了首个高光谱遥感图像显著目标检测数据集HRSSD,并开发了具有跨层级显著性评估模块和高分辨率融合模块的深度光谱显著性网络DSSN,有效解决了尺度变化和复杂背景等挑战,在多个数据集上验证了其优越性能和泛化能力。
English: This paper introduces the first hyperspectral remote sensing image salient object detection dataset HRSSD and proposes the Deep Spectral Saliency Network (DSSN) with innovative modules to address challenges like scale variation and cluttered backgrounds, demonstrating superior performance and generalizability across multiple datasets.

Authors:Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang
Title: AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology
Abstract:
The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.
中文: AnesSuite是首个专为麻醉学推理设计的综合数据集套件,通过评估基准和训练数据开发了基线模型Morpheus,该模型在有限训练下实现了显著性能提升,媲美更大规模模型。
English: AnesSuite is introduced as the first comprehensive dataset suite for evaluating and training large language models in anesthesiology reasoning, leading to the development of Morpheus, a baseline model that shows significant performance improvements despite limited training.

Authors:Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang
Title: AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
Abstract:
The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus demonstrates substantial performance improvements, rivaling the performance of larger-scale models. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.
中文: AnesSuite是首个专为麻醉学推理设计的综合数据集套件,通过评估基准和训练数据开发了基线模型Morpheus,该模型在有限训练下实现了显著性能提升,媲美更大规模模型。
English: AnesSuite is introduced as the first comprehensive dataset suite for evaluating and training large language models in anesthesiology reasoning, leading to the development of Morpheus, a baseline model that shows significant performance improvements despite limited training.

Authors:Takahiro Shirakawa, Tomoyuki Suzuki, Takuto Narumoto, Daichi Haraguchi
Title: MG-Gen: Single Image to Motion Graphics Generation
Abstract:
We introduce MG-Gen, a framework that generates motion graphics directly from a single raster image. MG-Gen decompose a single raster image into layered structures represented as HTML, generate animation scripts for each layer, and then render them into a video. Experiments confirm MG-Gen generates dynamic motion graphics while preserving text readability and fidelity to the input conditions, whereas state-of-the-art image-to-video generation methods struggle with them. The code is available at https://github.com/CyberAgentAILab/MG-GEN.
中文:MG-Gen框架能够将单个栅格图像分解为分层HTML结构并生成动画脚本,从而直接创建动态图形,在保持文本可读性和输入保真度方面优于现有方法。
English: MG-Gen is a framework that converts a single raster image into layered HTML structures and animation scripts to produce dynamic motion graphics, outperforming existing methods in preserving text and input fidelity.

Authors:Boris Sukhovilov
Title: Determining Sphere Radius through Pairwise Distances
Abstract:
We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at https://github.com/boris-sukhovilov/Sphere_Radius.
中文: 本文提出了一种基于点间距离测量估算球面半径的新方法,考虑了测量误差和球体形状偏差,并给出了最小化估计方差的最优点配置及开源代码实现。
English: This paper introduces a novel method for estimating the radius of a spherical surface using pairwise distance measurements, accounting for measurement errors and shape deviations, and provides an optimal point configuration for minimal estimation variance with open-source implementation.

Authors:Ye Su, Hezhe Qiao, Di Wu, Yuwen Chen, Lin Chen
Title: Temporal Gaussian Copula For Clinical Multivariate Time Series Data Imputation
Abstract:
The imputation of the Multivariate time series (MTS) is particularly challenging since the MTS typically contains irregular patterns of missing values due to various factors such as instrument failures, interference from irrelevant data, and privacy regulations. Existing statistical methods and deep learning methods have shown promising results in time series imputation. In this paper, we propose a Temporal Gaussian Copula Model (TGC) for three-order MTS imputation. The key idea is to leverage the Gaussian Copula to explore the cross-variable and temporal relationships based on the latent Gaussian representation. Subsequently, we employ an Expectation-Maximization (EM) algorithm to improve robustness in managing data with varying missing rates. Comprehensive experiments were conducted on three real-world MTS datasets. The results demonstrate that our TGC substantially outperforms the state-of-the-art imputation methods. Additionally, the TGC model exhibits stronger robustness to the varying missing ratios in the test dataset. Our code is available at https://github.com/MVL-Lab/TGC-MTS.
Chinese: 本文提出了一种时序高斯Copula模型(TGC),通过捕捉跨变量和时间依赖性来有效填补多元时间序列中的缺失值,实验证明该模型在性能和鲁棒性上均优于现有方法。
English: The paper introduces a Temporal Gaussian Copula Model (TGC) that effectively imputes missing values in multivariate time series by capturing cross-variable and temporal dependencies, demonstrating superior performance and robustness in experiments compared to existing methods.

Authors:Minheng Ni, Ennan Wu, Zidong Gong, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Lijuan Wang, Wangmeng Zuo
Title: Measurement of LLM's Philosophies of Human Nature
Abstract:
The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.
中文摘要:本研究基于人类本性哲学量表设计了针对大语言模型的M-PHNS评估体系,发现当前大语言模型普遍存在对人类系统性不信任的现象,且模型智能水平与人类信任度呈负相关,同时提出的心理循环学习框架通过道德场景交互有效提升了模型对人类本性的信任态度。
English Summary: This study introduces the Machine-based Philosophies of Human Nature Scale (M-PHNS), revealing that current large language models systematically distrust humans and showing that higher intelligence correlates with lower trust, while proposing a mental loop learning framework that significantly improves their trust through ethical scenario interactions.

Authors:Xinyu Luo, Kecheng Chen, Pao-Sheng Vincent Sun, Chris Xing Tian, Arindam Basu, Haoliang Li
Title: SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets. Furthermore, SPACE exhibits robust generalization across diverse network architectures, consistently enhancing the performance of SNNs on CNNs, Transformer, and ConvLSTM architectures. Experimental results show that SPACE outperforms state-of-the-art ANN methods while maintaining lower computational cost, highlighting its effectiveness and robustness for SNNs in real-world settings. The code will be available at https://github.com/ethanxyluo/SPACE.
Chinese: 提出的SPACE方法是首个专为脉冲神经网络设计的无源单实例测试时自适应方法,利用脉冲动态特性增强对分布偏移的鲁棒性,在多种网络架构上优于现有方法且计算成本更低。
English: The proposed SPACE method is a novel source-free and single-instance test-time adaptation approach designed specifically for Spiking Neural Networks, leveraging their spike dynamics to enhance robustness against distribution shifts and outperforming existing methods across various architectures with lower computational cost.

Authors:Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide
Title: MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Abstract:
Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition. The source code is available at https://github.com/thanhhff/MultiTSF.
中文: 本文提出了针对多模态动作识别实际挑战的MultiSensor-Home数据集和基于Transformer的MultiTSF方法,该方法能动态建模视图间关系并增强空间特征学习,实验证明其性能优于现有最优方法。
English: This paper introduces the MultiSensor-Home dataset addressing real-world challenges in multi-modal action recognition and proposes MultiTSF, a Transformer-based method that dynamically models inter-view relationships and enhances spatial feature learning, demonstrating superior performance over existing approaches.

Authors:Amit Rand, Hadi Ibrahim
Title: Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation
Abstract:
Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
中文: MXA模块作为一种新型注意力机制,集成于EfficientViT架构中,通过捕捉局部细节和全局上下文,显著提升了多标签胸部X光分类性能,在CheXpert数据集上实现了0.85的AUC值。
English: The MXA block, a novel attention mechanism integrated into the EfficientViT architecture, significantly enhances multi-label chest X-ray classification by capturing both local details and global context, achieving a 0.85 AUC on the CheXpert dataset.

Authors:Shaocong Long, Qianyu Zhou, Xiangtai Li, Chenhao Ying, Yunhai Tong, Lizhuang Ma, Yuan Luo, Dacheng Tao
Title: Generative Classifier for Domain Generalization
Abstract:
Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.
中文: 领域泛化旨在提升模型对分布变化的适应性,本研究提出GCDG方法,采用高斯混合模型的生成式分类器,通过异质性学习、伪相关阻断和组件平衡模块,有效捕捉领域特定信息并提升泛化性能。
English: Domain generalization (DG) aims to enhance model adaptability to distribution shifts, and this study introduces GCDG, a generative classifier using Gaussian Mixture Models to effectively capture domain-specific information and improve generalization through modules for heterogeneity learning, spurious correlation blocking, and component balancing.

Authors:Wenzhuo Liu, Wenshuo Wang, Yicheng Qiao, Qiannan Guo, Jiayin Zhu, Pengfei Li, Zilong Chen, Huiming Yang, Zhiwei Li, Lening Wang, Tiao Tan, Huaping Liu
Title: MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
Abstract:
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.
中文: 本文提出了MMTL-UniAD框架,通过多轴注意力机制和双分支嵌入设计,在联合学习驾驶员行为、情绪、车辆行为与交通环境的同时有效抑制任务间负迁移,在AIDE数据集上实现了最优性能。
English: This paper introduces MMTL-UniAD, a unified multi-modal multi-task learning framework that jointly recognizes driver behavior, emotion, vehicle behavior, and traffic context while mitigating negative transfer through multi-axis attention and dual-branch embedding, achieving superior performance on the AIDE dataset.

Authors:Tae-Young Lee, Sundong Park, Minwoo Jeon, Hyoseok Hwang, Gyeong-Moon Park
Title: ESC: Erasing Space Concept for Knowledge Deletion
Abstract:
As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at http://github.com/KU-VGI/ESC .
Chinese: 本文提出知识删除(KD)概念,通过ESC和ESC-T方法在特征空间中彻底消除个人知识,在多种场景下实现了最先进的安全高效知识遗忘性能。
English: This paper introduces Knowledge Deletion (KD) to address privacy concerns by completely erasing personal knowledge from models, proposing the ESC and ESC-T methods that achieve state-of-the-art performance in secure and efficient knowledge removal across various scenarios.

Authors:Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He
Title: Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
Abstract:
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
Chinese: 本研究提出了一种数据合成流程和UNO模型,以解决主题驱动图像生成中的数据扩展和主题扩展难题,在单主题与多主题生成中均实现了高度一致性和可控性。
English: This study introduces a data synthesis pipeline and the UNO model to overcome challenges in subject-driven image generation, achieving high consistency and controllability in both single- and multi-subject scenarios.

Authors:Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Lars Schimmelpfennig, Levi Kaster, Di Huang, Carlos Cruchaga, Guangfu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li
Title: OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling
Abstract:
Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.
中文: OmniCellTOSG推出了首个细胞文本-组学信号图谱数据集,将人类可读的生物注释与定量基因数据相结合,通过图推理解码不同条件下的复杂细胞信号系统。
English: OmniCellTOSG introduces the first dataset of cell text-omic signaling graphs that integrates human-readable biological annotations with quantitative gene data, enabling graph reasoning to decode complex cell signaling systems across diverse conditions.

Authors:Georgios Hadjiantonis, Sarah Gillet, Marynel Vázquez, Iolanda Leite, Fethiye Irmak Dogan
Title: Let's move on: Topic Change in Robot-Facilitated Group Discussions
Abstract:
Robot-moderated group discussions have the potential to facilitate engaging and productive interactions among human participants. Previous work on topic management in conversational agents has predominantly focused on human engagement and topic personalization, with the agent having an active role in the discussion. Also, studies have shown the usefulness of including robots in groups, yet further exploration is still needed for robots to learn when to change the topic while facilitating discussions. Accordingly, our work investigates the suitability of machine-learning models and audiovisual non-verbal features in predicting appropriate topic changes. We utilized interactions between a robot moderator and human participants, which we annotated and used for extracting acoustic and body language-related features. We provide a detailed analysis of the performance of machine learning approaches using sequential and non-sequential data with different sets of features. The results indicate promising performance in classifying inappropriate topic changes, outperforming rule-based approaches. Additionally, acoustic features exhibited comparable performance and robustness compared to the complete set of multimodal features. Our annotated data is publicly available at https://github.com/ghadj/topic-change-robot-discussions-data-2024.
Chinese Summary: 本研究探索了利用机器学习模型结合视听特征来预测机器人主持的群体讨论中的最佳话题转换时机,结果表明其性能优于基于规则的方法,并强调了声学特征具有相当的预测有效性。
English Summary: This study explores the use of machine learning models with audiovisual features to predict optimal topic transitions in robot-moderated group discussions, demonstrating superior performance over rule-based methods while highlighting acoustic features' comparable effectiveness.

Authors:Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri
Title: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Abstract:
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
中文: 大型语言模型可通过自回归元调度与选择性旧数据回放实现高效更新,在通用网络数据上需依赖回放防止遗忘,而在特定领域则需求较低,能以2.6倍计算效率达到与完全重新训练相当的效能。
English: Large Language Models (LLMs) can be efficiently updated using autoregressive meta-schedules with selective replay of older data, achieving performance comparable to full retraining while reducing computational costs by 2.6 times, though the need for replay varies between generic web data and specialized domains.

Authors:Zhonghang Li, Lianghao Xia, Xubin Ren, Jiabin Tang, Tianyi Chen, Yong Xu, Chao Huang
Title: Urban Computing in the Era of Large Language Models
Abstract:
Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs' functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.
城市计算利用数据驱动技术应对城市挑战,本综述探讨了大型语言模型(LLMs)如何在交通、环境监测等领域提升数据处理、决策支持和公众参与,同时提出了未来发展方向和相关工具。
Urban computing leverages data-driven technologies to tackle urban challenges, and this survey explores how Large Language Models (LLMs) can enhance data processing, decision-making, and citizen engagement across domains like transportation and environmental monitoring, while proposing future directions and tools for advancement.

Authors:Ilir Tahiraj, Markus Edinger, Dominik Kulmer, Markus Lienkamp
Title: CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups
Abstract:
In autonomous systems, sensor calibration is essential for safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping fields of view and does not require any external sensing devices. First, we apply motion to produce field of view overlaps and utilize a simple Unscented Kalman Filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open-source and available on https://github.com/TUMFTM/CaLiV.
中文:CaLiV算法实现了多激光雷达系统的外参标定,无需重叠视场或外部设备即可完成传感器间及传感器与车辆间的校准,通过基于运动的点云配准方法在平移和旋转参数上均获得高精度结果。
English: The CaLiV algorithm enables extrinsic calibration of multi-LiDAR systems for both sensor-to-sensor and sensor-to-vehicle parameters without requiring overlapping fields of view or external devices, achieving high accuracy in both translation and rotation through motion-based point cloud alignment.

Authors:Oliver Hahn, Christoph Reich, Nikita Araslanov, Daniel Cremers, Christian Rupprecht, Stefan Roth
Title: Scene-Centric Unsupervised Panoptic Segmentation
Abstract:
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
Chinese: 本研究首次提出一种无监督全景分割方法,直接基于场景中心图像进行训练,无需依赖物体中心数据或人工标注,通过伪标签训练和全景自训练策略,显著提升了全景分割质量。
English: This study introduces the first unsupervised panoptic segmentation method that trains directly on scene-centric imagery, eliminating the need for object-centric data or human annotations by leveraging pseudo-label training and panoptic self-training to achieve significant improvements in panoptic quality.

Authors:Jing Liu, Wenxuan Wang, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
Title: Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Abstract:
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.
中文: 本文提出了多粒度指代表达分割(MRES)任务及RefCOCOm基准和MRES-32M数据集,开发了统一多模态模型UniRES++,在多个RES基准测试中实现了最优性能。
English: This paper introduces a multi-granularity referring expression segmentation (MRES) task with the new RefCOCOm benchmark and MRES-32M dataset, proposing UniRES++, a unified multimodal model that achieves state-of-the-art performance across various RES benchmarks.

Authors:Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, Zhaoxiang Zhang
Title: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Abstract:
End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at https://github.com/liyingyanUCAS/WoTE.
中文:WoTE框架采用高效的BEV世界模型进行自动驾驶轨迹评估,通过低延迟实时预测未来状态,在基准测试中实现了最优性能。
English: The WoTE framework introduces an efficient BEV world model for trajectory evaluation in autonomous driving, achieving state-of-the-art performance on benchmarks by enabling real-time prediction of future states with minimal latency.

Authors:Boshi Wang, Huan Sun
Title: Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
Abstract:
Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.
Chinese: 大语言模型中的逆转诅咒源于Transformer在概念绑定上的局限,而基于JEPA的新型模型设计结合记忆层不仅突破了这一诅咒,还通过参数化前向链实现了卓越的算术推理能力。
English: The Reversal Curse in LLMs arises from transformers' limitations in conceptual binding, and a novel JEPA-based model design with memory layers overcomes this curse, enabling superior arithmetic reasoning through parametric forward-chaining.

Authors:Andrey Sidorenko, Michael Platzer, Mario Scriminaci, Paul Tiwald
Title: Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework
Abstract:
Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa.
ZH: 本研究提出一个综合框架,通过多维指标和基准测试策略量化合成数据的分布保真度与隐私保护效果,以评估其数据质量。
EN: This study introduces a comprehensive framework for evaluating synthetic data quality by quantifying distributional fidelity and privacy protection through multi-dimensional metrics and benchmarking strategies.

Authors:Yanzhou Su, Tianbin Li, Jiyao Liu, Chenglong Ma, Junzhi Ning, Cheng Tang, Sibo Ju, Jin Ye, Pengcheng Chen, Ming Hu, Shixiang Tang, Lihao Liu, Bin Fu, Wenqi Shao, Xiaowei Hu, Xiangwen Liao, Yuanfeng Ji, Junjun He
Title: GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning
Abstract:
Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.
中文: 本文提出GMAI-VL-R1,一种通过强化学习和新型推理数据合成方法增强的多模态医疗推理模型,显著提升了医疗任务中的诊断准确性和泛化能力。
English: This paper introduces GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning and a novel reasoning data synthesis method, which significantly improves diagnostic accuracy and generalization in medical tasks.

Authors:Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
Title: PaperBench: Evaluating AI's Ability to Replicate AI Research
Abstract:
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Chinese: PaperBench是一个评估AI代理从零开始复现20篇ICML 2024顶尖论文能力的基准测试,通过分层量规和自动评估系统进行客观评测,目前最佳模型仅实现21%的复现完成度,尚未超越人类专家水平。
English: PaperBench is a benchmark that assesses AI agents' ability to replicate 20 high-profile ICML 2024 papers from scratch, using detailed rubrics and an automated LLM judge, with top-performing agents achieving only 21% replication scores and still trailing behind human experts.

Authors:Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
Title: LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
Abstract:
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.
Chinese Summary: LRAGE是一个专注于法律领域的开源工具,用于全面评估检索增强生成系统,通过图形界面和命令行界面帮助用户分析五个关键组件对整体准确性的影响。
English Summary: LRAGE is an open-source tool designed for the holistic evaluation of retrieval-augmented generation systems in the legal domain, enabling users to assess the impact of five key components on overall accuracy through both GUI and CLI interfaces.

Authors:Nusrat Munia, Abdullah-Al-Zubaer Imran
Title: Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Abstract:
Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at https://github.com/Munia03/DermDiT
中文:提出的DermDiT框架利用生成式AI和视觉语言模型创建合成皮肤镜图像,通过增强数据集中代表性不足群体的样本来解决皮肤疾病诊断中的偏差问题。
English: The proposed DermDiT framework uses generative AI and vision-language models to create synthetic dermoscopic images, addressing biases in skin disease diagnosis by improving representation of underrepresented groups in datasets.

Authors:Huayang Huang, Xiangye Jin, Jiaxu Miao, Yu Wu
Title: Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
Abstract:
The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics. The strong concealment and transferability of our attack across various scenarios further underscore the significance of our approach. Code is available at https://github.com/Hannah1102/IBI-attacks.
中文: 本文提出文本到图像扩散模型中一种缺乏明确视觉特征但能在不同语义背景下多样化表现的新型隐性偏见,并设计了一种即插即用的攻击框架(IBI-Attacks),能在保持原始语义的同时注入难以察觉的偏见。
English: This paper introduces a novel implicit bias in text-to-image diffusion models that lacks explicit visual patterns but manifests diversely across semantic contexts, proposing a plug-and-play attack framework (IBI-Attacks) that injects subtle biases while preserving original image semantics.

Authors:Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
Title: SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Abstract:
Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.
中文: SpaceR框架通过包含15.1万样本的数据集和具有地图想象机制的新型强化学习方法,有效提升了多模态大语言模型的视频空间推理能力,在空间推理基准测试中达到最优性能,同时保持卓越的视频理解表现。
English: The SpaceR framework addresses video spatial reasoning limitations in MLLMs by introducing a 151k-sample dataset and a novel reinforcement learning method with map imagination, achieving state-of-the-art performance on spatial reasoning benchmarks while maintaining competitive video understanding capabilities.

Authors:Neville K. Kitson, Anthony C. Constantinou
Title: Stable Structure Learning with HC-Stable and Tabu-Stable Algorithms
Abstract:
Many Bayesian Network structure learning algorithms are unstable, with the learned graph sensitive to arbitrary dataset artifacts, such as the ordering of columns (i.e., variable order). PC-Stable attempts to address this issue for the widely-used PC algorithm, prompting researchers to use the "stable" version instead. However, this problem seems to have been overlooked for score-based algorithms. In this study, we show that some widely-used score-based algorithms, as well as hybrid and constraint-based algorithms, including PC-Stable, suffer from the same issue. We propose a novel solution for score-based greedy hill-climbing that eliminates instability by determining a stable node order, leading to consistent results regardless of variable ordering. Two implementations, HC-Stable and Tabu-Stable, are introduced. Tabu-Stable achieves the highest BIC scores across all networks, and the highest accuracy for categorical networks. These results highlight the importance of addressing instability in structure learning and provide a robust and practical approach for future applications. This extends the scope and impact of our previous work presented at Probabilistic Graphical Models 2024 by incorporating continuous variables. The implementation, along with usage instructions, is freely available on GitHub at https://github.com/causal-iq/discovery.
Chinese: 本研究揭示了包括PC-Stable在内的常用评分型、混合型和约束型贝叶斯网络算法均存在变量顺序敏感性问题,提出了HC-Stable和Tabu-Stable解决方案,其中Tabu-Stable在所有网络中取得了最优的BIC分数和分类网络准确率。
English: This study reveals that widely-used score-based, hybrid, and constraint-based Bayesian Network algorithms, including PC-Stable, suffer from instability due to variable ordering and proposes HC-Stable and Tabu-Stable solutions, with Tabu-Stable achieving top performance metrics.

Authors:Kaan Karaman, Yuchang Jiang, Damien Robert, Vivien Sainte Fare Garnot, Maria João Santos, Jan Dirk Wegner
Title: GSR4B: Biomass Map Super-Resolution with Sentinel-1/2 Guidance
Abstract:
Accurate Above-Ground Biomass (AGB) mapping at both large scale and high spatio-temporal resolution is essential for applications ranging from climate modeling to biodiversity assessment, and sustainable supply chain monitoring. At present, fine-grained AGB mapping relies on costly airborne laser scanning acquisition campaigns usually limited to regional scales. Initiatives such as the ESA CCI map attempt to generate global biomass products from diverse spaceborne sensors but at a coarser resolution. To enable global, high-resolution (HR) mapping, several works propose to regress AGB from HR satellite observations such as ESA Sentinel-1/2 images. We propose a novel way to address HR AGB estimation, by leveraging both HR satellite observations and existing low-resolution (LR) biomass products. We cast this problem as Guided Super-Resolution (GSR), aiming at upsampling LR biomass maps (sources) from $100$ to $10$ m resolution, using auxiliary HR co-registered satellite images (guides). We compare super-resolving AGB maps with and without guidance, against direct regression from satellite images, on the public BioMassters dataset. We observe that Multi-Scale Guidance (MSG) outperforms direct regression both for regression ($-780$ t/ha RMSE) and perception ($+2.0$ dB PSNR) metrics, and better captures high-biomass values, without significant computational overhead. Interestingly, unlike the RGB+Depth setting they were originally designed for, our best-performing AGB GSR approaches are those that most preserve the guide image texture. Our results make a strong case for adopting the GSR framework for accurate HR biomass mapping at scale. Our code and model weights are made publicly available (https://github.com/kaankaramanofficial/GSR4B).
Chinese Summary: 该研究提出了一种引导式超分辨率方法,利用卫星图像将低分辨率生物量地图提升至高分辨率,相比直接回归方法在生物量估算精度上表现更优。
English Summary: The study introduces a Guided Super-Resolution method that enhances low-resolution biomass maps to high resolution using satellite imagery, demonstrating superior accuracy in biomass estimation compared to direct regression approaches.

Authors:Taehan Lee, Hyukjun Lee
Title: Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
Abstract:
Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification models using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost: TopK token pruning can reduce MAC operations of AudioMAE and AST by 30-40%, with less than a 1% drop in accuracy. Our analysis reveals that while high-intensity or high-variation tokens contribute significantly to model accuracy, low-intensity or low variation tokens also remain important when token pruning is applied; pruning solely based on the intensity or variation of signals in a patch leads to a noticeable drop in accuracy. We support our claim by measuring high correlation between attention scores and these statistical features and by showing retained tokens consistently receive distinct attention compared to pruned ones. We also show that AudioMAE retains more low-intensity tokens than AST. This can be explained by AudioMAE's self-supervised reconstruction objective, which encourages attention to all patches, whereas AST's supervised training focuses on label-relevant tokens.
中文:令牌剪枝可将基于ViT的音频模型计算成本降低30-40%且精度损失小于1%,但由于音频时频表征的特殊性,需综合关注度与统计特征来平衡令牌重要性,而非仅依赖信号强度或变化程度。
English: Token pruning reduces computational costs by 30-40% in ViT-based audio models with minimal accuracy loss, but requires balancing token importance beyond simple intensity or variation metrics due to audio-specific challenges.

Authors:Bo-Kai Ruan, Yi-Zeng Fang, Hong-Han Shuai, Juinn-Dar Huang
Title: Anomaly Detection for Hybrid Butterfly Subspecies via Probability Filtering
Abstract:
Detecting butterfly hybrids requires knowledge of the parent subspecies, and the process can be tedious when encountering a new subspecies. This study focuses on a specific scenario where a model trained to recognize hybrid species A can generalize to species B when B biologically mimics A. Since species A and B share similar patterns, we leverage BioCLIP as our feature extractor to capture features based on their taxonomy. Consequently, the algorithm designed for species A can be transferred to B, as their hybrid and non-hybrid patterns exhibit similar relationships. To determine whether a butterfly is a hybrid, we adopt proposed probability filtering and color jittering to augment and simulate the mimicry. With these approaches, we achieve second place in the official development phase. Our code is publicly available at https://github.com/Justin900429/NSF-HDR-Challenge.
Chinese: 本研究证明,通过生物拟态,训练用于检测蝴蝶物种A杂交的模型可推广至物种B,利用BioCLIP进行特征提取,并结合概率过滤和颜色抖动技术,在挑战赛中荣获第二名。
English: This study demonstrates that a model trained to detect hybrids in butterfly species A can generalize to species B through biological mimicry, utilizing BioCLIP for feature extraction and achieving second place in the challenge with techniques like probability filtering and color jittering.

Authors:Yiting Lu, Xin Li, Haoning Wu, Bingchen Li, Weisi Lin, Zhibo Chen
Title: Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Abstract:
The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved the way for the possible Explainable Image Quality Assessment (EIQA) with instruction tuning from two perspectives: overall quality explanation, and attribute-wise perception answering. However, existing works usually overlooked the conflicts between these two types of perception explanations during joint instruction tuning, leading to insufficient perception understanding. To mitigate this, we propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the synergy between these two EIQA tasks when adapting LMM, resulting in enhanced multi-faceted explanations of IQA. Particularly, we propose a progressive instruction tuning strategy by dividing the adaption process of LMM for EIQA into two stages, where the first stage empowers the LMM with universal perception knowledge tailored for two tasks using an efficient transfer learning strategy, i.e., LoRA, and the second stage introduces the instruction-adaptive visual prompt tuning to dynamically adapt visual features for the different instructions from two tasks. In this way, our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance and, in some instances, superior results across perceptual-related benchmarks and commonly-used IQA databases. The source code is publicly available at https://github.com/yeppp27/Q-Adapt.
中文:Q-Adapt方法通过渐进式指令调优解决可解释图像质量评估中整体与属性感知的冲突,在提升大模型多维度解释能力的同时,实现了轻量化且具有竞争力的评估效果。
English: The Q-Adapt method addresses conflicts between overall and attribute-wise explanations in Explainable Image Quality Assessment by using progressive instruction tuning to enhance LMM performance, achieving lightweight yet competitive results on benchmarks.

Authors:Changshuo Zhang, Zihan Lin, Shukai Liu, Yongqi Liu, Han Li
Title: Comment Staytime Prediction with LLM-enhanced Comment Understanding
Abstract:
In modern online streaming platforms, the comments section plays a critical role in enhancing the overall user experience. Understanding user behavior within the comments section is essential for comprehensive user interest modeling. A key factor of user engagement is staytime, which refers to the amount of time that users browse and post comments. Existing watchtime prediction methods struggle to adapt to staytime prediction, overlooking interactions with individual comments and their interrelation. In this paper, we present a micro-video recommendation dataset with video comments (named as KuaiComt) which is collected from Kuaishou platform. correspondingly, we propose a practical framework for comment staytime prediction with LLM-enhanced Comment Understanding (LCU). Our framework leverages the strong text comprehension capabilities of large language models (LLMs) to understand textual information of comments, while also incorporating fine-grained comment ranking signals as auxiliary tasks. The framework is two-staged: first, the LLM is fine-tuned using domain-specific tasks to bridge the video and the comments; second, we incorporate the LLM outputs into the prediction model and design two comment ranking auxiliary tasks to better understand user preference. Extensive offline experiments demonstrate the effectiveness of our framework, showing significant improvements on the task of comment staytime prediction. Additionally, online A/B testing further validates the practical benefits on industrial scenario. Our dataset KuaiComt (https://github.com/lyingCS/KuaiComt.github.io) and code for LCU (https://github.com/lyingCS/LCU) are fully released.
中文摘要:本文提出了一种利用大语言模型理解评论内容并结合细粒度排序信号的创新框架,用于预测用户在视频评论区的停留时间,通过线下实验和线上A/B测试验证了其显著效果。
English Summary: This paper introduces a novel framework for predicting user staytime in video comments by leveraging large language models to understand comment content and incorporating fine-grained ranking signals, demonstrating significant improvements through both offline experiments and online A/B testing.

Authors:Yuehui Qiu, Dandan Shan, Yining Wang, Pei Dong, Dijia Wu, Xinnian Yang, Qingqi Hong, Dinggang Shen
Title: A topology-preserving three-stage framework for fully-connected coronary artery extraction
Abstract:
Coronary artery extraction is a crucial prerequisite for computer-aided diagnosis of coronary artery disease. Accurately extracting the complete coronary tree remains challenging due to several factors, including presence of thin distal vessels, tortuous topological structures, and insufficient contrast. These issues often result in over-segmentation and under-segmentation in current segmentation methods. To address these challenges, we propose a topology-preserving three-stage framework for fully-connected coronary artery extraction. This framework includes vessel segmentation, centerline reconnection, and missing vessel reconstruction. First, we introduce a new centerline enhanced loss in the segmentation process. Second, for the broken vessel segments, we further propose a regularized walk algorithm to integrate distance, probabilities predicted by a centerline classifier, and directional cosine similarity, for reconnecting the centerlines. Third, we apply implicit neural representation and implicit modeling, to reconstruct the geometric model of the missing vessels. Experimental results show that our proposed framework outperforms existing methods, achieving Dice scores of 88.53\% and 85.07\%, with Hausdorff Distances (HD) of 1.07mm and 1.63mm on ASOCA and PDSCA datasets, respectively. Code will be available at https://github.com/YH-Qiu/CorSegRec.
Chinese: 本文提出了一种三阶段框架用于全连接冠状动脉提取,通过血管分割、中心线重连和缺失血管重建解决分割难题,在基准数据集上分别达到88.53%和85.07%的Dice分数,性能优于现有方法。
English: This paper presents a three-stage framework for fully-connected coronary artery extraction that addresses segmentation challenges through vessel segmentation, centerline reconnection, and missing vessel reconstruction, achieving superior performance with Dice scores of 88.53% and 85.07% on benchmark datasets.

Authors:Jijun Xiang, Xuan Zhu, Xianqi Wang, Yu Wang, Hong Zhang, Fei Guo, Xin Yang
Title: DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
Abstract:
Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor
Chinese: 提出的DEPTHOR方法通过模拟真实世界dToF数据进行鲁棒训练,并结合单目深度估计来增强深度补全,在基准数据集上以显著提升的精度指标实现了最先进的性能。
English: The proposed DEPTHOR method enhances depth completion by simulating real-world dToF data for robust training and integrating monocular depth estimation, achieving state-of-the-art performance with significant improvements in accuracy metrics on benchmark datasets.

Authors:Xinyi Li, Shenghai Yuan, Haoxin Cai, Shunan Lu, Wenhua Wang, Jianqi Liu
Title: LL-Localizer: A Life-Long Localization System based on Dynamic i-Octree
Abstract:
This paper proposes an incremental voxel-based life-long localization method, LL-Localizer, which enables robots to localize robustly and accurately in multi-session mode using prior maps. Meanwhile, considering that it is difficult to be aware of changes in the environment in the prior map and robots may traverse between mapped and unmapped areas during actual operation, we will update the map when needed according to the established strategies through incremental voxel map. Besides, to ensure high performance in real-time and facilitate our map management, we utilize Dynamic i-Octree, an efficient organization of 3D points based on Dynamic Octree to load local map and update the map during the robot's operation. The experiments show that our system can perform stable and accurate localization comparable to state-of-the-art LIO systems. And even if the environment in the prior map changes or the robots traverse between mapped and unmapped areas, our system can still maintain robust and accurate localization without any distinction. Our demo can be found on Blibili (https://www.bilibili.com/video/BV1faZHYCEkZ) and youtube (https://youtu.be/UWn7RCb9kA8) and the program will be available at https://github.com/M-Evanovic/LL-Localizer.
中文: 本文提出LL-Localizer,一种基于增量体素的终身定位方法,通过动态i-Octree结构实现地图实时更新,使机器人在环境变化或跨越已测绘/未测绘区域时仍能保持稳定精准的定位。
English: The paper introduces LL-Localizer, an incremental voxel-based lifelong localization method that enables robust and accurate robot localization in changing environments by dynamically updating maps using a Dynamic i-Octree structure.

Authors:Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong
Title: STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation
Abstract:
Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.
中文: STPNet是一种尺度感知文本提示网络,通过多尺度文本描述和检索-分割联合学习增强医学图像分割效果,在无需推理阶段文本输入的情况下超越了现有最优方法。
English: STPNet is a scale-aware text prompt network that enhances medical image segmentation by leveraging multi-scale textual descriptions and retrieval-segmentation joint learning, outperforming state-of-the-art methods without requiring text input during inference.

Authors:Luca Ciampi, Gabriele Lagani, Giuseppe Amato, Fabrizio Falchi
Title: Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training
Abstract:
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation
中文: 本文提出了一种基于去噪扩散模型的新型半监督师生框架,通过迭代伪标签优化有效利用有限标注数据进行生物医学图像分割,在多个基准测试中超越了现有最优方法。
English: This paper introduces a novel semi-supervised teacher-student framework using denoising diffusion models for biomedical image segmentation, which outperforms existing methods by effectively leveraging limited labeled data through iterative pseudo-label refinement.

Authors:Zixuan Wang, Duo Peng, Feng Chen, Yuwei Yang, Yinjie Lei
Title: Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
Abstract:
Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.
中文摘要:本文提出了一种模块化条件图像合成框架,将条件划分为文本、布局和拖拽三个基本单元,通过专门设计的对齐模块增强模型对多样化生成任务的适应性和应用范围。
English Summary: This paper introduces a modular framework for conditional image synthesis that divides conditions into text, layout, and drag units, with specialized alignment modules for each to enhance adaptability and application scope across diverse generation tasks.

Authors:Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang
Title: MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage
Abstract:
Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.
中文: MLKV是一个高效、可扩展且可复用的数据存储框架,通过普及优化策略解决嵌入模型训练中的数据停滞和过时问题,其性能比工业级键值存储高出1.6至12.6倍。
English: MLKV is a scalable and reusable data storage framework that addresses data stall and staleness in embedding model training by democratizing optimizations and outperforming existing key-value stores by 1.6-12.6 times.

Authors:Soumyya Kanti Datta, Shan Jia, Siwei Lyu
Title: Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
Abstract:
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
Chinese: 本文提出LIPINC-V2检测框架,通过识别嘴部区域的时空不一致性来有效检测细微的唇语同步深度伪造,在基准数据集上实现了最先进的检测性能。
English: This paper introduces LIPINC-V2, a novel detection framework that identifies spatiotemporal inconsistencies in the mouth region to effectively detect subtle lip-syncing deepfakes, achieving state-of-the-art performance on benchmark datasets.

Authors:Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O'Brien, Kevin Zhu
Title: FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations
Abstract:
In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.
中文: FAIRE基准测试揭示了用于简历评估的AI模型存在不同程度的种族和性别偏见,强调了在AI驱动招聘中减少偏见的紧迫性。
English: The FAIRE benchmark reveals varying degrees of racial and gender bias in AI models used for resume evaluation, underscoring the need for bias mitigation in AI-driven hiring.

Authors:Korbinian Moller, Truls Nyberg, Jana Tumova, Johannes Betz
Title: Pedestrian-Aware Motion Planning for Autonomous Driving in Complex Urban Scenarios
Abstract:
Motion planning in uncertain environments like complex urban areas is a key challenge for autonomous vehicles (AVs). The aim of our research is to investigate how AVs can navigate crowded, unpredictable scenarios with multiple pedestrians while maintaining a safe and efficient vehicle behavior. So far, most research has concentrated on static or deterministic traffic participant behavior. This paper introduces a novel algorithm for motion planning in crowded spaces by combining social force principles for simulating realistic pedestrian behavior with a risk-aware motion planner. We evaluate this new algorithm in a 2D simulation environment to rigorously assess AV-pedestrian interactions, demonstrating that our algorithm enables safe, efficient, and adaptive motion planning, particularly in highly crowded urban environments - a first in achieving this level of performance. This study has not taken into consideration real-time constraints and has been shown only in simulation so far. Further studies are needed to investigate the novel algorithm in a complete software stack for AVs on real cars to investigate the entire perception, planning and control pipeline in crowded scenarios. We release the code developed in this research as an open-source resource for further studies and development. It can be accessed at the following link: https://github.com/TUM-AVS/PedestrianAwareMotionPlanning
中文摘要:本研究提出一种创新算法,将社会力原理与风险感知运动规划相结合,使自动驾驶汽车能够在拥挤城市环境中实现安全高效的导航,并通过二维仿真验证了其性能。
English Summary: This research introduces a novel algorithm combining social force principles with risk-aware motion planning to enable autonomous vehicles to navigate crowded urban environments safely and efficiently, as demonstrated through 2D simulations.

Authors:Korbinian Moller, Luis Schwarzmeier, Johannes Betz
Title: From Shadows to Safety: Occlusion Tracking and Risk Mitigation for Urban Autonomous Driving
Abstract:
Autonomous vehicles (AVs) must navigate dynamic urban environments where occlusions and perception limitations introduce significant uncertainties. This research builds upon and extends existing approaches in risk-aware motion planning and occlusion tracking to address these challenges. While prior studies have developed individual methods for occlusion tracking and risk assessment, a comprehensive method integrating these techniques has not been fully explored. We, therefore, enhance a phantom agent-centric model by incorporating sequential reasoning to track occluded areas and predict potential hazards. Our model enables realistic scenario representation and context-aware risk evaluation by modeling diverse phantom agents, each with distinct behavior profiles. Simulations demonstrate that the proposed approach improves situational awareness and balances proactive safety with efficient traffic flow. While these results underline the potential of our method, validation in real-world scenarios is necessary to confirm its feasibility and generalizability. By utilizing and advancing established methodologies, this work contributes to safer and more reliable AV planning in complex urban environments. To support further research, our method is available as open-source software at: https://github.com/TUM-AVS/OcclusionAwareMotionPlanning
本研究通过将顺序推理融入虚拟智能体模型来增强自动驾驶车辆的安全性,改进了遮挡追踪和风险评估,仿真实验表明该方法能提升环境感知能力并平衡交通流效率。
This research enhances autonomous vehicle safety by integrating sequential reasoning into phantom agent modeling for improved occlusion tracking and risk assessment, with simulations showing better situational awareness and traffic flow balance.

Authors:Kecen Li, Chen Gong, Xiaochen Li, Yuzhong Zhao, Xinwen Hou, Tianhao Wang
Title: From Easy to Hard: Building a Shortcut for Differentially Private Image Synthesis
Abstract:
Differentially private (DP) image synthesis aims to generate synthetic images from a sensitive dataset, alleviating the privacy leakage concerns of organizations sharing and utilizing synthetic images. Although previous methods have significantly progressed, especially in training diffusion models on sensitive images with DP Stochastic Gradient Descent (DP-SGD), they still suffer from unsatisfactory performance. In this work, inspired by curriculum learning, we propose a two-stage DP image synthesis framework, where diffusion models learn to generate DP synthetic images from easy to hard. Unlike existing methods that directly use DP-SGD to train diffusion models, we propose an easy stage in the beginning, where diffusion models learn simple features of the sensitive images. To facilitate this easy stage, we propose to use `central images', simply aggregations of random samples of the sensitive dataset. Intuitively, although those central images do not show details, they demonstrate useful characteristics of all images and only incur minimal privacy costs, thus helping early-phase model training. We conduct experiments to present that on the average of four investigated image datasets, the fidelity and utility metrics of our synthetic images are 33.1% and 2.1% better than the state-of-the-art method.
中文: 本文提出了一种受课程学习启发的两阶段差分隐私图像合成框架,通过使用中心图像让扩散模型从易到难学习,在保真度和实用性上相比现有方法取得了显著提升。
English: This paper introduces a two-stage differentially private image synthesis framework inspired by curriculum learning, which uses central images to train diffusion models from easy to hard, achieving significant improvements in fidelity and utility over existing methods.

Authors:Chang-Bin Zhang, Jinhong Ni, Yujie Zhong, Kai Han
Title: v-CLR: View-Consistent Learning for Open-World Instance Segmentation
Abstract:
In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr
中文: 本文提出视图一致性学习(v-CLR)框架,通过跨视图特征一致性训练模型学习外观不变的对象表示,在开放世界实例分割任务中取得领先性能。
English: This paper introduces view-Consistent LeaRning (v-CLR), a framework that enhances open-world instance segmentation by training models to recognize appearance-invariant object representations through cross-view feature consistency, achieving state-of-the-art results on benchmarks.

Authors:Zhe Jiang, Sam Ainsworth, Timothy Jones
Title: FireGuard: A Generalized Microarchitecture for Fine-Grained Security Analysis on OoO Superscalar Cores
Abstract:
High-performance security guarantees rely on hardware support. Generic programmable support for fine-grained instruction analysis has gained broad interest in the literature as a fundamental building block for the security of future processors. Yet, implementation in real out-of-order (OoO) superscalar processors presents tough challenges that cannot be explored in highly abstract simulators. We detail the challenges of implementing complex programmable pathways without critical paths or contention. We then introduce FireGuard, the first implementation of fine-grained instruction analysis on a real OoO superscalar processor. We establish an end-to-end system, including microarchitecture, SoC, ISA and programming model. Experiments show that our solution simultaneously ensures both security and performance of the system, with parallel scalability. We examine the feasibility of building FireGuard into modern SoCs: Apple's M1-Pro, Huawei's Kirin-960, and Intel's i7-12700F, where less than 1% silicon area is introduced. The Repo. of FireGuard's source code: https://github.com/SEU-ACAL/reproduce-FireGuard-DAC-25.
中文摘要:高性能安全保障依赖硬件支持,FireGuard首次在真实乱序超标量处理器上实现了细粒度指令分析,在保证安全与性能的同时,硅片面积增加不足1%。
English Summary: High-performance security requires hardware support, and FireGuard is the first implementation of fine-grained instruction analysis on a real out-of-order superscalar processor, ensuring both security and performance with minimal silicon area increase.

Authors:Lin Zhang, Zhouhong Gu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: LITE: LLM-Impelled efficient Taxonomy Evaluation
Abstract:
This paper presents LITE, an LLM-based evaluation method designed for efficient and flexible assessment of taxonomy quality. To address challenges in large-scale taxonomy evaluation, such as efficiency, fairness, and consistency, LITE adopts a top-down hierarchical evaluation strategy, breaking down the taxonomy into manageable substructures and ensuring result reliability through cross-validation and standardized input formats. LITE also introduces a penalty mechanism to handle extreme cases and provides both quantitative performance analysis and qualitative insights by integrating evaluation metrics closely aligned with task objectives. Experimental results show that LITE demonstrates high reliability in complex evaluation tasks, effectively identifying semantic errors, logical contradictions, and structural flaws in taxonomies, while offering directions for improvement. Code is available at https://github.com/Zhang-l-i-n/TAXONOMY_DETECT .
中文: 本文提出LITE,一种基于大语言模型的评估方法,通过分层策略、交叉验证和惩罚机制高效评估分类体系质量,在识别语义错误和结构缺陷方面展现出高可靠性。
English: This paper introduces LITE, an LLM-based evaluation method that efficiently assesses taxonomy quality through a hierarchical strategy, cross-validation, and penalty mechanisms, demonstrating high reliability in identifying errors and structural flaws.

Authors:Khoa A. Tran, John V. Pearson, Nicola Waddell
Title: xML-workFlow: an end-to-end explainable scikit-learn workflow for rapid biomedical experimentation
Abstract:
Motivation: Building and iterating machine learning models is often a resource-intensive process. In biomedical research, scientific codebases can lack scalability and are not easily transferable to work beyond what they were intended. xML-workFlow addresses this issue by providing a rapid, robust, and traceable end-to-end workflow that can be adapted to any ML project with minimal code rewriting. Results: We show a practical, end-to-end workflow that integrates scikit-learn, MLflow, and SHAP. This template significantly reduces the time and effort required to build and iterate on ML models, addressing the common challenges of scalability and reproducibility in biomedical research. Adapting our template may save bioinformaticians time in development and enables biomedical researchers to deploy ML projects. Availability and implementation: xML-workFlow is available at https://github.com/MedicalGenomicsLab/xML-workFlow.
中文摘要:xML-workFlow 提供了一种可扩展的端到端机器学习解决方案,通过整合 scikit-learn、MLflow 和 SHAP 工具并最小化代码修改,有效缩短生物医学研究中的开发时间并提升结果可复现性。
English Summary: xML-workFlow provides a scalable end-to-end machine learning solution that reduces development time and improves reproducibility in biomedical research by integrating scikit-learn, MLflow, and SHAP with minimal code adaptation.

Authors:Zhe Jiang, Minli Liao, Sam Ainsworth, Dean You, Timothy Jones
Title: MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors
Abstract:
Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering. We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model. We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads. By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code. The Repo. of MEEK's source code: https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25.
中文: MEEK是一种集成到开源SoC中的全RTL异构并行错误检测系统,通过对成熟处理器核心进行少量改动和轻量级软件协调,实现了微秒级故障检测能力。
English: MEEK is a full-RTL heterogeneous parallel error detection system integrated into an open-source SoC, providing microsecond-level fault detection with minimal design changes to a mature processor core and lightweight software coordination.

Authors:Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He
Title: RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking
Abstract:
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating them with an external knowledge base to improve the answer relevance and accuracy. In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables, and user questions often require retrieving answers that are distributed across multiple tables. Retrieving knowledge from a table corpora (i.e., various individual tables) for a question remains nascent, at least, for (i) how to understand intra- and inter-table knowledge effectively, (ii) how to filter unnecessary tables and how to retrieve the most relevant tables efficiently, (iii) how to prompt LLMs to infer over the retrieval, (iv) how to evaluate the corresponding performance in a realistic setting. Facing the above challenges, in this paper, we first propose a table-corpora-aware RAG framework, named T-RAG, which consists of the hierarchical memory index, multi-stage retrieval, and graph-aware prompting for effective and efficient table knowledge retrieval and inference. Further, we first develop a multi-table question answering benchmark named MultiTableQA, which spans 3 different task types, 57,193 tables, and 23,758 questions in total, and the sources are all from real-world scenarios. Based on MultiTableQA, we did the holistic comparison over table retrieval methods, RAG methods, and table-to-graph representation learning methods, where T-RAG shows the leading accuracy, recall, and running time performance. Also, under T-RAG, we evaluate the inference ability upgrade of different LLMs. Code and Data are available at https://github.com/jiaruzouu/T-RAG
中文: T-RAG是一种新颖的框架,通过从多个表格中有效检索和推理知识来增强检索增强生成,并在新开发的MultiTableQA基准测试中展现出卓越的准确性、召回率和效率性能。
English: T-RAG is a novel framework that enhances retrieval-augmented generation by effectively retrieving and inferring knowledge from multiple tables, demonstrating superior performance in accuracy, recall, and efficiency on the newly developed MultiTableQA benchmark.

Authors:Chunhui Zhang, Li Liu, Jialin Gao, Xin Sun, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang
Title: COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Abstract:
Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.
中文: 提出的COST框架采用对比式单阶段Transformer,通过最大化视频与语言间的互信息实现跨模态对齐,在包括新型小目标追踪数据集VL-SOT500在内的多个视觉语言追踪基准上取得了最优性能。
English: The proposed COST framework introduces a contrastive one-stage transformer that aligns video and language features through mutual information maximization, achieving state-of-the-art performance on multiple VL tracking benchmarks including the newly introduced VL-SOT500 dataset for small object tracking.

Authors:Jiawei Wang, Yushen Zuo, Yuanjun Chai, Zhendong Liu, Yicheng Fu, Yichun Feng, Kin-Man Lam
Title: Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Abstract:
Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.
中文摘要:视觉语言模型易受噪声图像攻击,而提出的Robust-VLGuard通过噪声增强微调和DiffPure-VLM防御方法,在保持模型功能的同时显著提升了对抗扰动的防御能力。
English Summary: Vision-Language Models remain vulnerable to jailbreak attacks through noisy images, but the proposed Robust-VLGuard with noise-augmented fine-tuning and DiffPure-VLM defense effectively mitigates these security risks while maintaining model functionality.

Authors:Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang
Title: ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
Abstract:
We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at https://github.com/UCSB-NLP-Chang/ThinkPrune.
中文总结:ThinkPrune通过强化学习对长思维大模型进行迭代式思维剪枝,在AIME24数据集上实现了推理长度减半而性能仅下降2%的显著效果。
English Summary: ThinkPrune is a reinforcement learning-based method that optimizes long-thinking LLMs by iteratively pruning redundant reasoning steps, achieving a 50% reduction in reasoning length with minimal performance loss on the AIME24 dataset.

Authors:Salim Khazem, Jeremy Fix, Cédric Pradalier
Title: PolygoNet: Leveraging Simplified Polygonal Representation for Effective Image Classification
Abstract:
Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in https://github.com/salimkhazem/PolygoNet.
Chinese: 本文提出一种高效的深度学习方法,通过采用多边形图像表示来降低计算复杂度、防止过拟合,并能在保持与先进方法相当性能的同时,实现在边缘设备上的部署。
English: This paper introduces an efficient deep learning method that uses polygonal image representations to reduce computational complexity, prevent overfitting, and enable deployment on edge devices while maintaining performance comparable to state-of-the-art approaches.

Authors:Jose Gallego-Posada, Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien
Title: Cooper: A Library for Constrained Optimization in Deep Learning
Abstract:
Cooper is an open-source package for solving constrained optimization problems involving deep learning models. Cooper implements several Lagrangian-based first-order update schemes, making it easy to combine constrained optimization algorithms with high-level features of PyTorch such as automatic differentiation, and specialized deep learning architectures and optimizers. Although Cooper is specifically designed for deep learning applications where gradients are estimated based on mini-batches, it is suitable for general non-convex continuous constrained optimization. Cooper's source code is available at https://github.com/cooper-org/cooper.
中文: Cooper是一个用于深度学习约束优化的开源工具包,它将拉格朗日方法与PyTorch的自动微分和专业架构等特性相结合。
English: Cooper is an open-source toolkit for constrained optimization in deep learning, integrating Lagrangian methods with PyTorch's features like automatic differentiation and specialized architectures.

Authors:Xin Hong, Aochu Dai, Dingchao Gao, Sanjiang Li, Zhengfeng Ji, Mingsheng Ying
Title: LimTDD: A Compact Decision Diagram Integrating Tensor and Local Invertible Map Representations
Abstract:
Tensor networks serve as a powerful tool for efficiently representing and manipulating high-dimensional data in applications such as quantum physics, machine learning, and data compression. Tensor Decision Diagrams (TDDs) offer an efficient framework for tensor representation by leveraging decision diagram techniques. However, the current implementation of TDDs and other decision diagrams fail to exploit tensor isomorphisms, limiting their compression potential. This paper introduces Local Invertible Map Tensor Decision Diagrams (LimTDDs), an extension of TDDs that incorporates local invertible maps (LIMs) to achieve more compact representations. Unlike LIMDD, which uses Pauli operators for quantum states, LimTDD employs the $XP$-stabilizer group, enabling broader applicability across tensor-based tasks. We present efficient algorithms for normalization, slicing, addition, and contraction, critical for tensor network applications. Theoretical analysis demonstrates that LimTDDs achieve greater compactness than TDDs and, in best-case scenarios and for quantum state representations, offer exponential compression advantages over both TDDs and LIMDDs. Experimental results in quantum circuit tensor computation and simulation confirm LimTDD's superior efficiency. Open-source code is available at https://github.com/Veriqc/LimTDD.
中文:LimTDD通过引入基于XP稳定子群的局部可逆映射扩展了张量决策图,在量子与通用计算任务中实现了更优越的张量网络压缩性能和计算效率。
English: LimTDDs extend Tensor Decision Diagrams by incorporating local invertible maps using the XP-stabilizer group, achieving superior compression and efficiency in tensor network applications across quantum and general computational tasks.

Authors:Gregory M. Campbell, Gentian Muhaxheri, Leonardo Ferreira Guilhoto, Christian D. Santangelo, Paris Perdikaris, James Pikul, Mark Yim
Title: Active Learning Design: Modeling Force Output for Axisymmetric Soft Pneumatic Actuators
Abstract:
Soft pneumatic actuators (SPA) made from elastomeric materials can provide large strain and large force. The behavior of locally strain-restricted hyperelastic materials under inflation has been investigated thoroughly for shape reconfiguration, but requires further investigation for trajectories involving external force. In this work we model force-pressure-height relationships for a concentrically strain-limited class of soft pneumatic actuators and demonstrate the use of this model to design SPA response for object lifting. We predict relationships under different loadings by solving energy minimization equations and verify this theory by using an automated test rig to collect rich data for n=22 Ecoflex 00-30 membranes. We collect this data using an active learning pipeline to efficiently model the design space. We show that this learned material model outperforms the theory-based model and naive curve-fitting approaches. We use our model to optimize membrane design for different lift tasks and compare this performance to other designs. These contributions represent a step towards understanding the natural response for this class of actuator and embodying intelligent lifts in a single-pressure input actuator system.
中文: 本研究针对同心应变限制型软气动执行器进行建模与优化,证明了在预测力-压力关系和提升执行器性能方面,学习得到的材料模型优于理论模型及简单曲线拟合方法。
English: This study models and optimizes concentrically strain-limited soft pneumatic actuators for object lifting, demonstrating that a learned material model outperforms theoretical and curve-fitting approaches in predicting force-pressure relationships and enhancing actuator design.

Authors:Ilir Tahiraj, Jeremialie Swadiryus, Felix Fent, Markus Lienkamp
Title: Cal or No Cal? -- Real-Time Miscalibration Detection of LiDAR and Camera Sensors
Abstract:
The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle's safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code is open source and available on https://github.com/TUMFTM/MiscalibrationDetection.
Chinese: 本文提出了一种基于对比学习的失准检测框架,通过将传感器校准状态分类为已校准或未校准,在检测性能、速度和资源效率上均优于现有方法。
English: The paper introduces a miscalibration detection framework using contrastive learning to classify sensor calibration states as calibrated or miscalibrated, outperforming current methods in detection performance, speed, and efficiency.

Authors:Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong
Title: An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection
Abstract:
Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git.
中文: 该研究提出了一种结合OCT-X算法与先进硬件的集成系统,实现了99.70%的胃癌早期诊断准确率,显著优于现有方法。
English: The study introduces an integrated system combining the OCT-X algorithm with advanced hardware to achieve 99.70% diagnostic accuracy for early gastric cancer detection, significantly outperforming existing methods.

Authors:Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan
Title: AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Abstract:
Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.
中文: 生成式游戏的最新进展允许玩家通过语言指令与动漫世界互动,而提出的AnimeGamer模型通过多模态表示生成动态且上下文一致的动画,解决了现有方法的一致性和静态画面限制问题。
English: Recent advances in generative games enable players to interact with anime worlds through language instructions, and the proposed AnimeGamer model addresses inconsistencies and static limitations by using multimodal representations to generate dynamic, contextually consistent animations.

Authors:Saarthak Kapse, Pushpak Pati, Srikar Yellapragada, Srijan Das, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna
Title: GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Abstract:
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO
中文摘要:GECKO提出了一种自监督预训练方法,通过将全切片图像嵌入与可解释的概念先验对齐,在多项任务中超越现有方法,同时提供具有临床意义的病理学解释。
English Summary: GECKO introduces a self-supervised pretraining method that aligns whole slide image embeddings with interpretable concept priors, outperforming existing approaches across multiple tasks while providing clinically meaningful insights.

Authors:Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
Title: When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
Abstract:
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.
Chinese: 在大语言模型测试时计算扩展中,自我一致性方法通过多数投票选择答案,而生成式奖励模型通过验证链评分,研究发现自我一致性在多数实际计算预算下效率更高,且最优推理策略更倾向于大力扩展解决方案生成。
English: Scaling test-time compute for large language models involves a trade-off between generating more solutions through Self-Consistency and using fewer solutions with Generative Reward Model verification, with findings showing Self-Consistency is more compute-efficient for most practical budgets and that optimal inference favors aggressive solution scaling.

Authors:Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, Yuyin Zhou
Title: MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
Abstract:
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code is available at https://github.com/UCSC-VLAA/MedReason.
Chinese: MedReason数据集通过知识图谱为32,682个临床问题生成逐步推理路径,填补了医疗推理数据的空白,经专业验证和实验证明能显著提升AI模型的诊断能力。
English: The MedReason dataset addresses the lack of transparent medical reasoning data by using a knowledge graph to generate step-by-step explanations for 32,682 clinical questions, significantly improving AI models' diagnostic accuracy through fine-tuning.

Authors:Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, Yang Zhou
Title: Time-optimal Convexified Reeds-Shepp Paths on a Sphere
Abstract:
This article addresses time-optimal path planning for a vehicle capable of moving both forward and backward on a unit sphere with a unit maximum speed, and constrained by a maximum absolute turning rate $U_{max}$. The proposed formulation can be utilized for optimal attitude control of underactuated satellites, optimal motion planning for spherical rolling robots, and optimal path planning for mobile robots on spherical surfaces or uneven terrains. By utilizing Pontryagin's Maximum Principle and analyzing phase portraits, it is shown that for $U_{max}\geq1$, the optimal path connecting a given initial configuration to a desired terminal configuration falls within a sufficient list of 23 path types, each comprising at most 6 segments. These segments belong to the set $\{C,G,T\}$, where $C$ represents a tight turn with radius $r=\frac{1}{\sqrt{1+U_{max}^2}}$, $G$ represents a great circular arc, and $T$ represents a turn-in-place motion. Closed-form expressions for the angles of each path in the sufficient list are derived. The source code for solving the time-optimal path problem and visualization is publicly available at https://github.com/sixuli97/Optimal-Spherical-Convexified-Reeds-Shepp-Paths.
中文: 本研究提出了一种在单位球面上双向移动的车辆时间最优路径规划方法,针对最大转向率≥1的情况确定了23种最多包含6段{C, G, T}路径类型的充分列表,可应用于卫星控制和机器人运动规划。
English: This study presents a time-optimal path planning method for vehicles moving bidirectionally on a unit sphere, identifying 23 path types with up to 6 segments from {C, G, T} for maximum turning rates ≥1, with applications in satellite control and robotics.

Authors:Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
Title: IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Abstract:
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.
中文: 研究者提出了实例驱动的多模态图像检索(IDMR)新任务,要求模型在文本描述的不同场景中检索包含相同实体对象的图像,并通过跨域合成方法生成训练数据,其基于多模态大语言模型的检索方法在传统和零样本基准测试中均优于现有模型。
English: The authors introduce Instance-Driven Multimodal Image Retrieval (IDMR), a novel task requiring models to retrieve images with the same object instance in varied text-described scenarios, and develop a cross-domain synthesis method to create training data, with their MLLM-based model outperforming existing approaches on both traditional and zero-shot benchmarks.

Authors:Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme
Title: WikiVideo: Article Generation from Multiple Videos
Abstract:
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
中文摘要:本文提出了WikiVideo基准,用于从多视频生成维基百科式文章,并开发了协作文章生成(CAG)方法,通过推理模型与视频大模型的交互提升事件语义理解能力,其性能显著优于现有技术。
English Summary: This paper introduces WikiVideo, a benchmark for generating Wikipedia-style articles from multiple videos, and proposes Collaborative Article Generation (CAG), an interactive method that enhances high-level event understanding by combining reasoning models with VideoLLMs, outperforming existing approaches.

Authors:Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin
Title: DBF-UNet: A Two-Stage Framework for Carotid Artery Segmentation with Pseudo-Label Generation
Abstract:
Medical image analysis faces significant challenges due to limited annotation data, particularly in three-dimensional carotid artery segmentation tasks, where existing datasets exhibit spatially discontinuous slice annotations with only a small portion of expert-labeled slices in complete 3D volumetric data. To address this challenge, we propose a two-stage segmentation framework. First, we construct continuous vessel centerlines by interpolating between annotated slice centroids and propagate labels along these centerlines to generate interpolated annotations for unlabeled slices. The slices with expert annotations are used for fine-tuning SAM-Med2D, while the interpolated labels on unlabeled slices serve as prompts to guide segmentation during inference. In the second stage, we propose a novel Dense Bidirectional Feature Fusion UNet (DBF-UNet). This lightweight architecture achieves precise segmentation of complete 3D vascular structures. The network incorporates bidirectional feature fusion in the encoder and integrates multi-scale feature aggregation with dense connectivity for effective feature reuse. Experimental validation on public datasets demonstrates that our proposed method effectively addresses the sparse annotation challenge in carotid artery segmentation while achieving superior performance compared to existing approaches. The source code is available at https://github.com/Haoxuanli-Thu/DBF-UNet.
中文: 本研究提出一个两阶段框架解决颈动脉三维分割中标注稀疏的问题,通过血管中心线插值生成标注并采用新型DBF-UNet架构实现双向特征融合,在获得精确血管分割的同时展现出优越性能。
English: This study introduces a two-stage framework for 3D carotid artery segmentation that addresses sparse annotations by generating interpolated labels via vessel centerlines and employs a novel DBF-UNet architecture with bidirectional feature fusion to achieve precise vascular segmentation with superior performance.

Authors:Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang
Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Abstract:
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Chinese: Agent S2通过组合式框架结合混合定位技术和主动分层规划,解决了图形界面交互中的核心难题,在多项计算机使用基准测试中创下了最优性能记录。
English: Agent S2 introduces a compositional framework with Mixture-of-Grounding and Proactive Hierarchical Planning to overcome GUI interaction challenges, achieving state-of-the-art performance across multiple computer use benchmarks.

Authors:Enzhe Sun, Yongchuan Cui, Peng Liu, Jining Yan
Title: A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities
Abstract:
Remote sensing spatiotemporal fusion (STF) addresses the fundamental trade-off between temporal and spatial resolution by combining high temporal-low spatial and high spatial-low temporal imagery. This paper presents the first comprehensive survey of deep learning advances in remote sensing STF over the past decade. We establish a systematic taxonomy of deep learning architectures including Convolutional Neural Networks (CNNs), Transformers, Generative Adversarial Networks (GANs), diffusion models, and sequence models, revealing significant growth in deep learning adoption for STF tasks. Our analysis reveals that CNN-based methods dominate spatial feature extraction, while Transformer architectures show superior performance in capturing long-range temporal dependencies. GAN and diffusion models demonstrate exceptional capability in detail reconstruction, substantially outperforming traditional methods in structural similarity and spectral fidelity. Through comprehensive experiments on seven benchmark datasets comparing ten representative methods, we validate these findings and quantify the performance trade-offs between different approaches. We identify five critical challenges: time-space conflicts, limited generalization across datasets, computational efficiency for large-scale processing, multi-source heterogeneous fusion, and insufficient benchmark diversity. The survey highlights promising opportunities in foundation models, hybrid architectures, and self-supervised learning approaches that could address current limitations and enable multimodal applications. The specific models, datasets, and other information mentioned in this article have been collected in: https://github.com/yc-cui/Deep-Learning-Spatiotemporal-Fusion-Survey.
中文摘要:本文首次系统综述了深度学习在遥感时空融合领域十年来的进展,通过分析不同架构的优势并指出当前五大挑战,同时展望了基础模型和混合架构等未来发展方向。
English Summary: This survey comprehensively reviews the past decade's advances in deep learning for remote sensing spatiotemporal fusion, analyzing various architectures' strengths and identifying key challenges alongside future opportunities in the field.

Authors:Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li
Title: CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models
Abstract:
Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases
中文: CrackSQL是一种结合规则与大型语言模型的混合式SQL方言翻译系统,通过功能化查询分割和跨方言语法嵌入技术提升翻译准确性,同时支持多种部署模式以适应实际应用场景。
English: CrackSQL is a hybrid SQL dialect translation system that combines rule-based and LLM-based methods to enhance accuracy and reduce manual effort by segmenting complex queries and employing cross-dialect syntax alignment.

Authors:Fenglei Hao, Yuliang Yang, Ruiyuan Su, Zhengran Zhao, Yukun Qiao, Mengyu Zhu
Title: GISE-TTT:A Framework for Global InformationSegmentation and Enhancement
Abstract:
This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal horizons. To overcome this limitation, we introduce GISE-TTT, anovel architecture that integrates Temporal Transformer (TTT) layers intotransformer-based frameworks through a co-designed hierarchical approach.The TTTlayer systematically condenses historical temporal information into hidden states thatencode globally coherent contextual representations. By leveraging multi-stagecontextual aggregation through hierarchical concatenation, our frameworkprogressively refines spatiotemporal dependencies across network layers. This designrepresents the first systematic empirical evidence that distributing global informationacross multiple network layers is critical for optimal dependency utilization in videosegmentation tasks.Ablation studies demonstrate that incorporating TTT modules athigh-level feature stages significantly enhances global modeling capabilities, therebyimproving the network's ability to capture long-range temporal relationships. Extensive experiments on DAVIS 2017 show that GISE-TTT achieves a 3.2%improvement in segmentation accuracy over the baseline model, providingcomprehensive evidence that global information should be strategically leveragedthroughout the network architecture.The code will be made available at:https://github.com/uuool/GISE-TTT.
中文: 本文提出GISE-TTT架构,通过分层设计的时序Transformer层有效捕捉长视频序列中的全局时间依赖关系,在DAVIS 2017数据集上实现了3.2%的精度提升。
English: This paper introduces GISE-TTT, a novel architecture that integrates Temporal Transformer layers through hierarchical design to effectively capture global temporal dependencies in long video sequences, achieving a 3.2% accuracy improvement on DAVIS 2017.

Authors:Xiaohua Qi, Renda Li, Long Peng, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei Han, Jing Xiao
Title: Data-free Knowledge Distillation with Diffusion Models
Abstract:
Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models' data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at https://github.com/xhqi0109/DiffDFKD.
中文:本文提出DiffDFKD,一种基于扩散模型的无数据知识蒸馏新方法,通过教师模型指导生成类训练数据并采用潜在CutMix增强技术提升多样性,实现了领先的性能。
English: This paper introduces DiffDFKD, a novel data-free knowledge distillation method using diffusion models to generate training-like data guided by teacher models and employs Latent CutMix Augmentation to enhance diversity efficiently, achieving state-of-the-art results.

Authors:Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Title: m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Abstract:
Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
中文: 测试时扩展显著提升了语言模型的医学推理能力,最佳推理标记预算约为4K,但其效果受限于医学知识不足而非仅推理深度。
English: Test-time scaling significantly enhances medical reasoning in language models, with an optimal token budget of around 4K, but its effectiveness is limited by insufficient medical knowledge rather than reasoning depth alone.

Authors:Xin Tong, Xuanhe Zhou, Bingsheng He, Guoliang Li, Zirui Tang, Wei Zhou, Fan Wu, Mian Lu, Yuqiang Chen
Title: FeatInsight: An Online ML Feature Management System on 4Paradigm Sage-Studio Platform
Abstract:
Feature management is essential for many online machine learning applications and can often become the performance bottleneck (e.g., taking up to 70% of the overall latency in sales prediction service). Improper feature configurations (e.g., introducing too many irrelevant features) can severely undermine the model's generalization capabilities. However, managing online ML features is challenging due to (1) large-scale, complex raw data (e.g., the 2018 PHM dataset contains 17 tables and dozens to hundreds of columns), (2) the need for high-performance, consistent computation of interdependent features with complex patterns, and (3) the requirement for rapid updates and deployments to accommodate real-time data changes. In this demo, we present FeatInsight, a system that supports the entire feature lifecycle, including feature design, storage, visualization, computation, verification, and lineage management. FeatInsight (with OpenMLDB as the execution engine) has been deployed in over 100 real-world scenarios on 4Paradigm's Sage Studio platform, handling up to a trillion-dimensional feature space and enabling millisecond-level feature updates. We demonstrate how FeatInsight enhances feature design efficiency (e.g., for online product recommendation) and improve feature computation performance (e.g., for online fraud detection). The code is available at https://github.com/4paradigm/FeatInsight.
中文: 特征管理对在线机器学习至关重要却常成为性能瓶颈,而FeatInsight作为全周期特征管理系统,已在众多实际场景中部署,显著提升了特征设计效率和计算性能。
English: Feature management is critical in online machine learning but often becomes a performance bottleneck, and FeatInsight is a comprehensive system that enhances efficiency and performance across the entire feature lifecycle, deployed in numerous real-world scenarios.

Authors:Yang Yang, Xijie Xu, Yixun Zhou, Jie Zheng
Title: CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification
Abstract:
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.
Chinese: CellVTA通过将基于CNN的适配器与视觉Transformer结合,保留高分辨率空间细节,显著提升了数字病理学中的细胞实例分割性能,在基准数据集上取得了领先成果。
English: CellVTA enhances cell instance segmentation in digital pathology by integrating a CNN-based adapter with Vision Transformers to preserve high-resolution spatial details, achieving state-of-the-art results on benchmark datasets.

Authors:Hyunwoo Park, Gun Ryu, Wonjun Kim
Title: DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
Abstract:
Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity. The code and model are publicly available at: https://github.com/DCVL-3D/DropGaussian release.
中文: DropGaussian是一种无先验方法,通过在训练中随机移除高斯函数来缓解稀疏视图三维高斯溅射中的过拟合问题,无需额外复杂度即可提升新视角合成的质量。
English: DropGaussian is a prior-free method that randomly removes Gaussians during training to mitigate overfitting in sparse-view 3D Gaussian splatting, enhancing novel view synthesis quality without added complexity.

Authors:Lin Zhang, Zhouhong Gu, Xiaoran Shi, Hongwei Feng, Yanghua Xiao
Title: RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model
Abstract:
As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at https://github.com/MikeGu721/reckon
中文: RECKON是一种基于参考数据的大语言模型高效知识评估方法,通过将非结构化数据组织成可管理单元并生成针对性问题,在降低56.5%资源消耗的同时,在多个领域保持了超过97%的准确率。
English: RECKON is an efficient knowledge evaluation method for large language models that uses reference data to generate targeted questions, reducing resource consumption by 56.5% while maintaining over 97% accuracy across multiple domains.

Authors:Yunsoo Kim, Michal W. S. Ong, Daniel W. Rogalsky, Manuel Rodriguez-Justo, Honghan Wu, Adam P. Levine
Title: IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models
Abstract:
Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, "Gemma-2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at https://github.com/knowlab/IHC-LLMiner.
中文: 本研究开发了IHC-LLMiner自动化流程,通过微调Gemma-2模型从PubMed摘要中高效提取并标准化免疫组化-肿瘤特征图谱,实现了高精度的大规模生物医学数据挖掘,为癌症研究提供有力支持。
English: This study introduces IHC-LLMiner, an automated pipeline using fine-tuned Gemma-2 models to efficiently extract and normalize IHC-tumor profiles from PubMed abstracts, achieving high accuracy and enabling large-scale biomedical data mining for cancer research.

Authors:Thomas E. Huber, Jules Lecomte, Borislav Polovnikov, Axel von Arnim
Title: Scaling Up Resonate-and-Fire Networks for Fast Deep Learning
Abstract:
Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at https://github.com/ThomasEHuber/s5-rf.
中文: 本文提出的S5-RF基于谐振发放神经元构建深度脉冲神经网络,在脉冲语音指令数据集上以更少训练时间和计算量实现了最优性能。
English: This paper introduces S5-RF, a deep spiking neural network based on resonate-and-fire neurons, which achieves state-of-the-art accuracy on the Spiking Speech Commands dataset with significantly reduced training time and computational operations.

Authors:Xiaoxuan Zhu, Zhouhong Gu, Baiqian Wu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Abstract:
Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at https://github.com/zxx000728/ToReMi.
中文: ToReMi框架通过主题关联和学习模式动态调整训练数据权重,在困惑度降低和下游任务表现上均优于传统方法。
English: The ToReMi framework dynamically adjusts training data weights based on topic associations and learning patterns, consistently outperforming conventional methods in both perplexity reduction and downstream task performance.

Authors:Anthony Yazdani, Ihor Stepanov, Douglas Teodoro
Title: GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition
Abstract:
Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types. To address these issues, we introduce GLiNER-BioMed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedicine. In contrast to conventional approaches, GLiNER uses natural language labels to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Experiments on several biomedical datasets demonstrate that GLiNER-BioMed outperforms the state-of-the-art in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline (p-value < 0.001). Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.
中文: GLiNER-BioMed提出了一种针对生物医学领域优化的轻量级命名实体识别模型,通过自然语言标签实现零样本实体识别,并借助合成数据生成策略在多个数据集上以5.96%的F1分数显著超越现有最佳方法。
English: GLiNER-BioMed introduces a domain-adapted, lightweight NER model that uses natural language labels for zero-shot recognition of biomedical entities, outperforming state-of-the-art methods with a 5.96% F1-score improvement through synthetic data generation and efficient model training.

Authors:Xianghong Xu, Xiao He, Tieying Zhang, Lei Zhang, Rui Shi, Jianjun Chen
Title: PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models
Abstract:
Number of Distinct Values (NDV) estimation of a multiset/column is a basis for many data management tasks, especially within databases. Despite decades of research, most existing methods require either a significant amount of samples through uniform random sampling or access to the entire column to produce estimates, leading to substantial data access costs and potentially ineffective estimations in scenarios with limited data access. In this paper, we propose leveraging semantic information, i.e., schema, to address these challenges. The schema contains rich semantic information that can benefit the NDV estimation. To this end, we propose PLM4NDV, a learned method incorporating Pre-trained Language Models (PLMs) to extract semantic schema information for NDV estimation. Specifically, PLM4NDV leverages the semantics of the target column and the corresponding table to gain a comprehensive understanding of the column's meaning. By using the semantics, PLM4NDV reduces data access costs, provides accurate NDV estimation, and can even operate effectively without any data access. Extensive experiments on a large-scale real-world dataset demonstrate the superiority of PLM4NDV over baseline methods. Our code is available at https://github.com/bytedance/plm4ndv.
中文: 本文提出PLM4NDV方法,利用预训练语言模型提取语义模式信息进行唯一值数量估算,可降低数据访问成本,并在无需数据访问的情况下依然有效工作。
English: This paper introduces PLM4NDV, a method that utilizes pre-trained language models to extract semantic schema information for accurate Number of Distinct Values estimation, reducing data access costs and performing effectively even without data access.

Authors:Jirui Qi, Raquel Fernández, Arianna Bisazza
Title: On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.
中文: 多语言检索增强生成使大语言模型能有效提取不同语言段落中的相关信息,但在正确语言中生成完整答案的能力较弱,尤其当存在多语言干扰段落时影响更显著。
English: Multilingual retrieval-augmented generation enables large language models to effectively extract relevant information from passages in different languages, yet they struggle to consistently produce full answers in the correct language, especially when distracted by irrelevant multilingual passages.

Authors:Owen Cook, Jake Vasilakes, Ian Roberts, Xingyi Song
Title: Efficient Annotator Reliability Assessment with EffiARA
Abstract:
Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework's efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft-label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at https://github.com/MiniEggz/EffiARA and the webtool is publicly accessible at https://effiara.gate.ac.uk.
中文:EffiARA框架是首个支持完整文档级标注流程的综合解决方案,通过提升标注可靠性和效率得到验证,现已作为开源Python包及易用的网络工具发布。
English: The EffiARA framework is the first comprehensive solution supporting the entire document-level annotation pipeline, enhancing reliability and efficiency, as demonstrated in previous studies, and is now available as an open-source Python package with a user-friendly webtool.

Authors:Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem
Title: SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Abstract:
Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE
Chinese: 本文提出SMILE自监督视频学习方法,通过融合图像语言模型的空间语义和合成运动模式来增强语义表示与动态捕捉能力,在无需自然视频数据的情况下,于多个数据集上超越现有最优方法。
English: This paper introduces SMILE, a self-supervised video learning method that enhances semantic representation and motion dynamics by integrating spatial semantics from image-language models and synthetic motion patterns, outperforming existing approaches across multiple datasets without requiring natural video data.

Authors:Shuyi Zhou, Shuxiang Xie, Ryoichi Ishikawa, Takeshi Oishi
Title: Robust LiDAR-Camera Calibration with 2D Gaussian Splatting
Abstract:
LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.
中文: 本文提出一种无需标定物的激光雷达-相机系统校准方法,利用二维高斯泼溅重建几何信息,并通过光度、重投影和三角测量损失优化外参,从而提升校准的鲁棒性和精度。
English: This paper introduces a novel targetless calibration method for LiDAR-camera systems that leverages 2D Gaussian Splatting to reconstruct geometric information and optimizes extrinsic parameters through photometric, reprojection, and triangulation losses for improved accuracy and robustness.

Authors:Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
Title: ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Abstract:
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
中文: 本文提出ShortV方法,无需训练即可通过识别并冻结多模态大语言模型中处理视觉令牌的无效层,在保持性能的同时显著降低计算成本,例如在LLaVA-NeXT-13B上实现50%的FLOPs减少。
English: This paper introduces ShortV, a training-free method that reduces computational costs in Multimodal Large Language Models by identifying and freezing ineffective layers during visual token processing, achieving up to 50% FLOPs reduction while maintaining performance.

Authors:Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su
Title: FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Abstract:
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.
Chinese: 本文提出了FortisAVQA数据集,通过重构问题和引入分布偏移来解决视听问答中的过拟合与鲁棒性问题,并设计了MAVEN去偏网络,在性能上实现了7.81%的显著提升,达到最先进水平。
English: This paper introduces FortisAVQA, a novel dataset designed to address overfitting and robustness issues in Audio-Visual Question Answering by incorporating rephrased questions and distribution shifts, and proposes MAVEN, a debiasing network that achieves state-of-the-art performance with a 7.81% improvement.

Authors:Zhuohao Li, Zhicheng Huang, Wenchao Liu, Zhuxin Zhang, Jianming Miao
Title: FSSUWNet: Mitigating the Fragility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images
Abstract:
Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at https://github.com/lizhh268/FSSUWNet.
中文摘要:本文提出FSSUWNet,一种专为水下环境设计的少样本语义分割框架,通过增强特征集成与对齐来克服现有方法的局限性,在公开数据集上实现了最优性能。
English Summary: This paper introduces FSSUWNet, a novel few-shot semantic segmentation framework designed for underwater environments that enhances feature integration and alignment to overcome the limitations of existing methods, achieving state-of-the-art performance on public datasets.

Authors:Haobo Yuan, Tao Zhang, Xiangtai Li, Lu Qi, Zilong Huang, Shilin Xu, Jiashi Feng, Ming-Hsuan Yang
Title: 4th PVUW MeViS 3rd Place Report: Sa2VA
Abstract:
Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
中文摘要:通过增强多模态大语言模型的测试时推理方法,无需额外训练即可显著提升在具有挑战性的MeViS视频指代分割数据集上的性能,最终在第四届PVUW研讨会中获得第三名。
English Summary: A simple modification to the test-time inference method on stronger multi-modal large language models significantly improves performance on the challenging MeViS dataset for referring video object segmentation, achieving third place in the 4th PVUW workshop without additional training.

Authors:Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao
Title: Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection
Abstract:
To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.
中文: 本文提出LiMA这一高效黑盒归因方法,通过将输入区域重要性重构为子模优化问题,在识别影响AI决策的关键区域方面实现了更优的准确性和效率。
English: This paper introduces LiMA, an efficient black-box attribution method that reformulates input region importance as a submodular optimization problem, achieving superior accuracy and speed in identifying influential regions for AI decisions.

Authors:Jiuzhou Han, Wray Buntine, Ehsan Shareghi
Title: VerifiAgent: a Unified Verification Agent in Language Model Reasoning
Abstract:
Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent
Chinese: VerifiAgent提出了一种结合元验证与自适应工具选择的统一验证框架,在所有推理任务中超越基线方法,同时提升准确率与效率。
English: VerifiAgent introduces a unified verification framework with meta-verification and adaptive tool selection, outperforming baseline methods across reasoning tasks while improving accuracy and efficiency.

Authors:Zilong Huang, Jun He, Junyan Ye, Lihan Jiang, Weijia Li, Yiping Chen, Ting Han
Title: Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
Abstract:
The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at https://github.com/LongHZ140516/Scene4U .
中文: 本文提出Scene4U,一种基于全景图像的分层三维场景重建框架,通过分解场景层、修复遮挡区域并采用三维高斯溅射优化,解决了视觉不连续和遮挡问题,实现了具有语义和结构一致性的沉浸式三维场景,并在性能指标上显著优于现有方法。
English: This paper introduces Scene4U, a novel layered 3D scene reconstruction framework from panoramic images that addresses visual discontinuities and occlusions by decomposing scenes into layers, restoring occluded regions, and optimizing with 3D Gaussian Splatting to achieve immersive, consistent 3D scenes with superior performance metrics.

Authors:Wanjing Zhang, Chenxing Wang
Title: Intrinsic-feature-guided 3D Object Detection
Abstract:
LiDAR-based 3D object detection is essential for autonomous driving systems. However, LiDAR point clouds may appear to have sparsity, uneven distribution, and incomplete structures, significantly limiting the detection performance. In road driving environments, target objects referring to vehicles, pedestrians and cyclists are well-suited for enhancing representation through the complete template guidance, considering their grid and topological structures. Therefore, this paper presents an intrinsic-feature-guided 3D object detection method based on a template-assisted feature enhancement module, which extracts intrinsic features from relatively generalized templates and provides rich structural information for foreground objects. Furthermore, a proposal-level contrastive learning mechanism is designed to enhance the feature differences between foreground and background objects. The proposed modules can act as plug-and-play components and improve the performance of multiple existing methods. Extensive experiments illustrate that the proposed method achieves the highly competitive detection results. Code will be available at https://github.com/zhangwanjingjj/IfgNet.git.
中文: 本文提出了一种基于模板辅助特征增强的三维物体检测方法,通过提取内在特征和对比学习机制,显著提升了车辆、行人和骑车人的检测性能,在自动驾驶领域展现出强大竞争力。
English: This paper introduces a template-guided 3D object detection method that enhances feature representation for vehicles, pedestrians, and cyclists through intrinsic feature extraction and contrastive learning, achieving competitive results in autonomous driving applications.

Authors:Ting Liu, Siyuan Li
Title: Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Abstract:
Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .
中文摘要:本文提出了一种无需训练的混合全局-局部特征提取方法和空间引导增强策略,显著提升了零样本指代图像分割中的掩码区域表示与文本对齐效果,在基准测试中表现卓越。
English Summary: This paper introduces a training-free hybrid global-local feature extraction method and a spatial guidance augmentation strategy to enhance mask region representation and alignment in zero-shot referring image segmentation, achieving superior performance on benchmarks.

Authors:Fan-Hao Lin, Tzu-Hao Huang, Chao-Kai Wen, Trung Q. Duong
Title: Geo2ComMap: Deep Learning-Based MIMO Throughput Prediction Using Geographic Data
Abstract:
Accurate communication performance prediction is crucial for wireless applications such as network deployment and resource management. Unlike conventional systems with a single transmit and receive antenna, throughput (Tput) estimation in antenna array-based multiple-output multiple-input (MIMO) systems is computationally intensive, i.e., requiring analysis of channel matrices, rank conditions, and spatial channel quality. These calculations impose significant computational and time burdens. This paper introduces Geo2ComMap, a deep learning-based framework that leverages geographic databases to efficiently estimate multiple communication metrics across an entire area in MIMO systems using only sparse measurements. To mitigate extreme prediction errors, we propose a sparse sampling strategy. Extensive evaluations demonstrate that Geo2ComMap accurately predicts full-area communication metrics, achieving a median absolute error of 27.35 Mbps for Tput values ranging from 0 to 1900 Mbps.
中文: 本文提出Geo2ComMap深度学习框架,利用地理数据和稀疏测量高效预测MIMO系统通信指标,在0至1900 Mbps吞吐量范围内实现27.35 Mbps中位绝对误差的高精度预测。
English: This paper presents Geo2ComMap, a deep learning framework that uses geographic data and sparse measurements to efficiently predict communication metrics in MIMO systems, achieving high accuracy with minimal error.

Authors:Thomas Bailie, Yun Sing Koh, S. Karthik Mukkavilli, Varvara Vetrova
Title: Reducing Smoothness with Expressive Memory Enhanced Hierarchical Graph Neural Networks
Abstract:
Graphical forecasting models learn the structure of time series data via projecting onto a graph, with recent techniques capturing spatial-temporal associations between variables via edge weights. Hierarchical variants offer a distinct advantage by analysing the time series across multiple resolutions, making them particularly effective in tasks like global weather forecasting, where low-resolution variable interactions are significant. A critical challenge in hierarchical models is information loss during forward or backward passes through the hierarchy. We propose the Hierarchical Graph Flow (HiGFlow) network, which introduces a memory buffer variable of dynamic size to store previously seen information across variable resolutions. We theoretically show two key results: HiGFlow reduces smoothness when mapping onto new feature spaces in the hierarchy and non-strictly enhances the utility of message-passing by improving Weisfeiler-Lehman (WL) expressivity. Empirical results demonstrate that HiGFlow outperforms state-of-the-art baselines, including transformer models, by at least an average of 6.1% in MAE and 6.2% in RMSE. Code is available at https://github.com/TB862/ HiGFlow.git.
Chinese: 分层图流网络通过引入动态内存缓冲区来缓解层次图形预测中的多分辨率信息损失,理论上提升了特征映射和消息传递表达能力,并在实验中以超过6%的误差指标优势超越了现有最优模型。
English: The Hierarchical Graph Flow (HiGFlow) network introduces a dynamic memory buffer to mitigate information loss across resolutions in hierarchical graphical forecasting, theoretically enhancing feature mapping and message-passing expressivity while empirically outperforming state-of-the-art models by over 6% in error metrics.

Authors:Muhammad Tahir, Shehroz S. Khan, James Davie, Soichiro Yamanaka, Ahmed Ashraf
Title: LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions
Abstract:
In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI
在哺乳动物和脊椎动物基因组中,增强子与启动子的相互作用无法通过距离可靠预测,尽管机器学习方法常显示高准确率,但随机数据集分割会导致性能高估,本文通过留一染色体交叉验证法和一种混合深度神经网络解决了此问题,提升了模型泛化能力。
In mammalian and vertebrate genomes, enhancer-promoter interactions (EPI) are not reliably predicted by proximity, and while machine learning methods often show high accuracy, random dataset splitting leads to performance overestimation, which is addressed through a Leave-one-chromosome-out (LOCO) cross-validation approach and a hybrid deep neural network that improves generalizability.

Authors:Pooya Ashtari, Shahryar Noei, Fateme Nateghi Haredasht, Jonathan H. Chen, Giuseppe Jurman, Aleksandra Pizurica, Sabine Van Huffel
Title: Deconver: A Deconvolutional Network for Medical Image Segmentation
Abstract:
While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES'22, BraTS'23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at https://github.com/pashtari/deconver.
中文: Deconver提出了一种新型U型网络,通过整合高效的非负反卷积技术替代注意力机制,在多个医学影像数据集中以降低高达90%的计算成本实现了最先进的分割性能。
English: Deconver introduces a novel U-shaped network that integrates efficient nonnegative deconvolution to replace attention mechanisms, achieving state-of-the-art segmentation performance with up to 90% lower computational costs across multiple medical imaging datasets.

Authors:Joshua Rodriguez, Om Sanan, Guillermo Vizarreta-Luna, Steven A. Conrad
Title: Text Chunking for Document Classification for Urban System Management using Large Language Models
Abstract:
Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.
本研究证明,当采用适当的提示方法(特别是通过保留局部语境的块分析)时,大型语言模型能够以与人类编码者相当的可靠性完成城市规划文档的定性编码。
This study demonstrates that large-language models can perform qualitative coding of urban planning documents with reliability comparable to human coders when using appropriate prompting methods, particularly through chunk-based analysis that preserves local context.

Authors:Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He
Title: SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
Abstract:
This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark's difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.
中文摘要:本研究评估大语言模型根据NLP论文算法描述生成代码的能力,提出了高难度的SciReplicate-Bench基准和双智能体框架,最佳模型执行准确率仅达39%,揭示了算法描述缺失或不一致是成功复现的主要障碍。
English Summary: This study assesses large language models' ability to generate code from algorithm descriptions in NLP papers, introducing the challenging SciReplicate-Bench benchmark and a dual-agent framework that achieved only 39% execution accuracy, revealing significant reproduction barriers.

Authors:Han Zhou, Wei Dong, Jun Chen
Title: LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors
Abstract:
Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality, normally-exposed representations due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, significant noise, and color distortion pose substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS does not achieve satisfactory performance due to their individual enhancement processes, which lead to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose LITA-GS, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively mitigate the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpasses the state-of-the-art (SOTA) NeRF-based method while enjoying faster inference speed and costing reduced training time. The code is released at https://github.com/LowLevelAI/LITA-GS.
中文摘要:LITA-GS方法通过引入光照无关的物理先验和渐进式去噪模块,解决了三维高斯溅射在恶劣光照条件下难以生成高质量场景表示的难题,在性能和效率上均超越现有先进技术。
English Summary: The proposed LITA-GS method overcomes 3D Gaussian Splatting's limitations in adverse lighting by integrating illumination-invariant physical priors and progressive denoising, achieving superior performance and efficiency compared to existing approaches.

Authors:Suzanne Stathatos, Michael Hobley, Markus Marks, Pietro Perona
Title: SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
Abstract:
Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD
Chinese: SAVeD是一种仅使用原始噪声数据进行自监督训练的方法,通过利用前景与背景运动差异来增强低信噪比视频中的目标可见性,并以更少资源超越现有最优去噪技术。
English: SAVeD is a self-supervised method that denoises low-SNR videos using raw noisy data alone, enhancing object visibility through motion differences and outperforming state-of-the-art denoising techniques with fewer resources.

Authors:Srinitish Srinivasan, Omkumar CU
Title: Can we ease the Injectivity Bottleneck on Lorentzian Manifolds for Graph Neural Networks?
Abstract:
While hyperbolic GNNs show promise for hierarchical data, they often have limited discriminative power compared to Euclidean counterparts or the WL test, due to non-injective aggregation. To address this expressivity gap, we propose the Lorentzian Graph Isomorphic Network (LGIN), a novel HGNN designed for enhanced discrimination within the Lorentzian model. LGIN introduces a new update rule that preserves the Lorentzian metric while effectively capturing richer structural information. This marks a significant step towards more expressive GNNs on Riemannian manifolds. Extensive evaluations across nine benchmark datasets demonstrate LGIN's superior performance, consistently outperforming or matching state-of-the-art hyperbolic and Euclidean baselines, showcasing its ability to capture complex graph structures. LGIN is the first to adapt principles of powerful, highly discriminative GNN architectures to a Riemannian manifold. The code for our paper can be found at https://github.com/Deceptrax123/LGIN
中文: 提出的洛伦兹图同构网络(LGIN)通过保留洛伦兹度量的新型更新规则,增强了双曲图神经网络的判别能力,在九个基准测试中展现出优越性能,能有效捕捉复杂图结构。
English: The proposed Lorentzian Graph Isomorphic Network (LGIN) enhances hyperbolic graph neural networks' discriminative power through a novel update rule that preserves the Lorentzian metric, demonstrating superior performance across nine benchmarks by effectively capturing complex graph structures.

Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Title: Times2D: Multi-Period Decomposition and Derivative Mapping for General Time Series Forecasting
Abstract:
Time series forecasting is an important application in various domains such as energy management, traffic planning, financial markets, meteorology, and medicine. However, real-time series data often present intricate temporal variability and sharp fluctuations, which pose significant challenges for time series forecasting. Previous models that rely on 1D time series representations usually struggle with complex temporal variations. To address the limitations of 1D time series, this study introduces the Times2D method that transforms the 1D time series into 2D space. Times2D consists of three main parts: first, a Periodic Decomposition Block (PDB) that captures temporal variations within a period and between the same periods by converting the time series into a 2D tensor in the frequency domain. Second, the First and Second Derivative Heatmaps (FSDH) capture sharp changes and turning points, respectively. Finally, an Aggregation Forecasting Block (AFB) integrates the output tensors from PDB and FSDH for accurate forecasting. This 2D transformation enables the utilization of 2D convolutional operations to effectively capture long and short characteristics of the time series. Comprehensive experimental results across large-scale data in the literature demonstrate that the proposed Times2D model achieves state-of-the-art performance in both short-term and long-term forecasting. The code is available in this repository: https://github.com/Tims2D/Times2D.
中文: 本研究提出Times2D方法,通过将一维时间序列转换为二维空间,利用周期性分解和导数热图捕捉复杂时序模式,经二维卷积处理后在综合实验中实现了最先进的预测性能。
English: This study introduces Times2D, a novel method that transforms 1D time series into 2D space using periodic decomposition and derivative heatmaps to capture complex temporal patterns through 2D convolutions, achieving state-of-the-art forecasting performance in comprehensive experiments.

Authors:J. V. S. Souza, C. B. Vieira, G. D. C. Cavalcanti, R. M. O. Cruz
Title: Imbalanced malware classification: an approach based on dynamic classifier selection
Abstract:
In recent years, the rise of cyber threats has emphasized the need for robust malware detection systems, especially on mobile devices. Malware, which targets vulnerabilities in devices and user data, represents a substantial security risk. A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications. We assess monolithic classifiers and ensemble methods, focusing on dynamic selection algorithms, which have shown superior performance compared to traditional approaches. In contrast to balancing strategies performed on the whole dataset, we propose a balancing procedure that works individually for each classifier in the pool. Our empirical analysis demonstrates that the KNOP algorithm obtained the best results using a pool of Random Forest. Additionally, an instance hardness assessment revealed that balancing reduces the difficulty of the minority class and enhances the detection of the minority class (malware). The code used for the experiments is available at https://github.com/jvss2/Machine-Learning-Empirical-Evaluation.
中文: 本研究通过评估多种机器学习策略解决安卓恶意软件检测中的类别不平衡问题,发现采用随机森林池的KNOP算法效果最佳,且平衡技术能有效提升少数类(恶意软件)的检测性能。
English: This study tackles class imbalance in Android malware detection by evaluating machine learning strategies, finding that the KNOP algorithm with a Random Forest pool achieves optimal results and that balancing techniques improve minority class detection.

Authors:Huan Zhao, Yiming Liu, Jina Yao, Ling Xiong, Zexin Zhou, Zixing Zhang
Title: Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation
Abstract:
Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.
中文: 单细胞技术的最新进展带来了大规模长尾数据标注的挑战,Celler通过引入高斯膨胀损失函数和硬数据挖掘策略的生成预训练模型有效应对,并辅以全面的Celler-75数据集支持。
English: Recent advances in single-cell technology present challenges in annotating large-scale, long-tailed data, which Celler addresses through its generative pre-training model featuring Gaussian Inflation Loss and Hard Data Mining strategy, supported by the extensive Celler-75 dataset.

Authors:Yuyao Zhang, Jinghao Li, Yu-Wing Tai
Title: LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
Abstract:
Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.
中文: LayerCraft是一个利用大语言模型作为自主代理的模块化框架,通过思维链推理实现结构化、分层的图像生成与编辑,无需重新训练现有模型即可直观控制空间构图和对象一致性。
English: LayerCraft is a modular framework employing large language models as autonomous agents to enable structured, layered image generation and editing, offering intuitive control over spatial composition and object consistency through chain-of-thought reasoning and seamless integration without retraining existing models.

Authors:Bingxiang He, Wenbin Zhang, Jiaxi Song, Cheng Qian, Zixuan Fu, Bowen Sun, Ning Ding, Haiwen Hong, Longtao Huang, Hui Xue, Ganqu Cui, Wanxiang Che, Zhiyuan Liu, Maosong Sun
Title: AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset
Abstract:
Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbf{A}nnotations, \textbf{I}nstructions, and \textbf{R}esponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we propose \textbf{AIR}, a component-wise analysis framework that systematically isolates and optimizes each component while evaluating their synergistic effects. Through rigorous experimentation, AIR reveals actionable principles: annotation simplicity (point-wise generative scoring), instruction inference stability (variance-based filtering across LLMs), and response pair quality (moderate margins + high absolute scores). When combined, these principles yield +5.3 average gains over baseline method, even with only 14k high-quality pairs. Our work shifts preference dataset design from ad hoc scaling to component-aware optimization, offering a blueprint for efficient, reproducible alignment.
中文摘要:AIR框架通过系统性地分离和优化偏好学习的三个核心要素——标注、指令和回答对,揭示了可操作的优化原则,实现了性能显著提升,并将数据集设计从随意扩展转向了组件感知的优化路径。
English Summary: The AIR framework systematically isolates and optimizes the three core components of preference learning—annotations, instructions, and response pairs—revealing actionable principles that yield significant performance gains and shifting dataset design from ad hoc scaling to component-aware optimization.

Authors:Yunhao Li, Sijing Wu, Wei Sun, Zhichao Zhang, Yucheng Zhu, Zicheng Zhang, Huiyu Duan, Xiongkuo Min, Guangtao Zhai
Title: AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images
Abstract:
The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.
中文: 该研究推出了首个针对AI生成人像质量评估的大规模基准AGHI-QA,并提出新型评估指标AGHI-Assessor,通过结合多模态模型与人体特征,在质量评估和失真检测方面显著优于现有方法。
English: The study introduces AGHI-QA, the first large-scale benchmark for assessing AI-generated human images, and proposes AGHI-Assessor, a novel metric that combines multimodal models with human features to outperform existing methods in quality evaluation and distortion detection.

Authors:Woo Yi Yang, Jiarui Wang, Sijing Wu, Huiyu Duan, Yuxin Zhu, Liu Yang, Kang Fu, Guangtao Zhai, Xiongkuo Min
Title: LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs
Abstract:
The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.
中文: 该研究提出了用于评估AI生成3D人脸的Gen3DHF基准和LMME3DHF多模态模型,该模型在预测质量分数和识别失真方面表现优异,同时与人类感知保持一致。
English: The study introduces Gen3DHF, a benchmark for evaluating AI-generated 3D human faces, and proposes LMME3DHF, a multimodal model that excels in predicting quality scores and identifying distortions while aligning with human perception.

Authors:Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, Xiongkuo Min
Title: EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment
Abstract:
The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
中文摘要:EEmo-Bench基准通过多样化任务和情感属性系统评估多模态大语言模型理解图像诱发情感的能力,结果表明尽管部分模型表现良好,但现有模型在情感分析维度上仍存在明显不足。
English Summary: The EEmo-Bench benchmark is introduced to systematically evaluate multimodal large language models' capabilities in understanding image-evoked emotions through diverse tasks and emotional attributes, revealing current limitations despite some promising performances.

Authors:Zeyu Zhang, Zijian Chen, Zicheng Zhang, Yuze Sun, Yuan Tian, Ziheng Jia, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Title: PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
Abstract:
Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.
中文: 提出的开放式视觉谜题生成(OVPG)框架通过谜题解答任务自动创建动态可验证的评估数据,解决了静态基准的局限性,并通过PuzzleBench的可扩展设计实现了对大型多模态模型核心能力的持续评估。
English: The proposed Open-ended Visual Puzzle Generation (OVPG) framework automatically creates dynamic and verifiable evaluation data through puzzle-solving tasks, addressing limitations of static benchmarks and enabling continuous assessment of Large Multimodal Models' core competencies via PuzzleBench's scalable design.

Authors:Jiaying Qian, Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Guangtao Zhai, Xiongkuo Min
Title: Towards Explainable Partial-AIGC Image Quality Assessment
Abstract:
The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.
中文摘要:本研究针对局部AI编辑图像质量评估的空白,首次构建了大规模数据集并开发了可解释的质量评估模型,通过基于大语言模型的三阶段训练范式实现了对局部编辑区域的定位、量化评分和质量解释。
English Summary: This research introduces the first large-scale dataset and explainable quality assessment model for partially AI-generated images, addressing a critical gap in evaluating localized AI edits through a novel three-stage training approach using large multi-modal models.

Authors:Kaiwei Zhang, Dandan Zhu, Xiongkuo Min, Guangtao Zhai
Title: Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes
Abstract:
Mesh saliency enhances the adaptability of 3D vision by identifying and emphasizing regions that naturally attract visual attention. To investigate the interaction between geometric structure and texture in shaping visual attention, we establish a comprehensive mesh saliency dataset, which is the first to systematically capture the differences in saliency distribution under both textured and non-textured visual conditions. Furthermore, we introduce mesh Mamba, a unified saliency prediction model based on a state space model (SSM), designed to adapt across various mesh types. Mesh Mamba effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. More importantly, by subgraph embedding and a bidirectional SSM, the model enables global context modeling for both local geometry and texture, preserving the topological structure and improving the understanding of visual details and structural complexity. Through extensive theoretical and empirical validation, our model not only improves performance across various mesh types but also demonstrates high scalability and versatility, particularly through cross validations of various visual features.
中文摘要:网格显著性通过识别视觉关注区域增强3D视觉适应性,而提出的mesh Mamba模型基于状态空间架构,通过子图嵌入和双向处理实现几何结构与纹理特征的统一建模,显著提升了各类网格的显著性预测性能。
English Summary: Mesh saliency improves 3D vision adaptability by highlighting visually significant regions, and the proposed mesh Mamba model effectively integrates geometric and texture features through state space modeling to enhance saliency prediction across diverse mesh types.

Authors:Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Dusit Niyato, Jiawen Kang, Zehui Xiong, Bo Qian, Haibo Zhou, Shiwen Mao, Abbas Jamalipour, Xuemin Shen, Dong In Kim
Title: Decentralization of Generative AI via Mixture of Experts for Wireless Networks: A Comprehensive Survey
Abstract:
Mixture of Experts (MoE) has emerged as a promising paradigm for scaling model capacity while preserving computational efficiency, particularly in large-scale machine learning architectures such as large language models (LLMs). Recent advances in MoE have facilitated its adoption in wireless networks to address the increasing complexity and heterogeneity of modern communication systems. This paper presents a comprehensive survey of the MoE framework in wireless networks, highlighting its potential in optimizing resource efficiency, improving scalability, and enhancing adaptability across diverse network tasks. We first introduce the fundamental concepts of MoE, including various gating mechanisms and the integration with generative AI (GenAI) and reinforcement learning (RL). Subsequently, we discuss the extensive applications of MoE across critical wireless communication scenarios, such as vehicular networks, unmanned aerial vehicles (UAVs), satellite communications, heterogeneous networks, integrated sensing and communication (ISAC), and mobile edge networks. Furthermore, key applications in channel prediction, physical layer signal processing, radio resource management, network optimization, and security are thoroughly examined. Additionally, we present a detailed overview of open-source datasets that are widely used in MoE-based models to support diverse machine learning tasks. Finally, this survey identifies crucial future research directions for MoE, emphasizing the importance of advanced training techniques, resource-aware gating strategies, and deeper integration with emerging 6G technologies.
中文: 本文全面综述了专家混合模型在无线网络中的应用,重点探讨了其在提升资源效率、可扩展性和适应性方面的潜力,并指出了与6G技术深度融合的未来研究方向。
English: This survey comprehensively examines the Mixture of Experts (MoE) framework's applications in wireless networks, highlighting its role in enhancing resource efficiency, scalability, and adaptability across diverse communication scenarios while identifying future research directions for integration with 6G technologies.

Authors:Jinbo Wen, Jiawen Kang, Yang Zhang, Yue Zhong, Dusit Niyato, Jie Xu, Jianhang Tang, Chau Yuen
Title: Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses
Abstract:
Mobile metaverses have attracted significant attention from both academia and industry, which are envisioned as the next-generation Internet, providing users with immersive and ubiquitous metaverse services through mobile devices. Driven by Large Language Models (LLMs) and Vision-Language Models (VLMs), Artificial Intelligence (AI) agents hold the potential to empower the creation, maintenance, and evolution of mobile metaverses. Currently, AI agents are primarily constructed using cloud-based LLMs and VLMs. However, several challenges hinder their effective implementation, including high service latency and potential sensitive data leakage during perception and processing. In this paper, we develop an edge-cloud collaboration-based federated AI agent construction framework in mobile metaverses. Specifically, Edge Servers (ESs), acting as agent infrastructures, collaboratively create agent modules in a distributed manner. The cloud server then integrates these modules into AI agents and deploys them at the edge, thereby enabling low-latency AI agent services for users. Considering that ESs may exhibit dynamic levels of willingness to participate in federated AI agent construction, we design a two-period dynamic contract model to continuously motivate ESs to participate in agent module creation, effectively addressing the dynamic information asymmetry between the cloud server and the ESs. Furthermore, we propose an Enhanced Diffusion Model-based Soft Actor-Critic (EDMSAC) algorithm to efficiently generate optimal dynamic contracts, in which dynamic structured pruning is applied to DM-based actor networks to enhance denoising efficiency and policy learning performance. Extensive simulations demonstrate the effectiveness and superiority of the EDMSAC algorithm and the proposed contract model.
中文: 本文提出了一种基于边云协同的移动元宇宙联邦AI智能体构建框架,通过动态契约激励边缘服务器参与,并采用增强算法优化性能,有效解决了高延迟和数据安全等挑战。
English: This paper introduces an edge-cloud collaboration framework for constructing federated AI agents in mobile metaverses, addressing challenges like high latency and data security by using dynamic contracts to motivate edge servers and an enhanced algorithm for optimal performance.

Authors:Minrui Xu, Dusit Niyato, Jiawen Kang, Zehui Xiong, Mingzhe Chen, Dong In Kim, Xuemin, Shen
Title: Hybrid Reinforcement Learning-based Sustainable Multi-User Computation Offloading for Mobile Edge-Quantum Computing
Abstract:
Exploiting quantum computing at the mobile edge holds immense potential for facilitating large-scale network design, processing multimodal data, optimizing resource management, and enhancing network security. In this paper, we propose a pioneering paradigm of mobile edge quantum computing (MEQC) that integrates quantum computing capabilities into classical edge computing servers that are proximate to mobile devices. To conceptualize the MEQC, we first design an MEQC system, where mobile devices can offload classical and quantum computation tasks to edge servers equipped with classical and quantum computers. We then formulate the hybrid classical-quantum computation offloading problem whose goal is to minimize system cost in terms of latency and energy consumption. To solve the offloading problem efficiently, we propose a hybrid discrete-continuous multi-agent reinforcement learning algorithm to learn long-term sustainable offloading and partitioning strategies. Finally, numerical results demonstrate that the proposed algorithm can reduce the MEQC system cost by up to 30% compared to existing baselines.
中文摘要:本文提出移动边缘量子计算(MEQC)新范式,通过将量子计算能力集成至邻近移动设备的边缘服务器,并采用混合离散-连续多智能体强化学习算法优化任务卸载策略,可降低系统成本高达30%。
English Summary: This paper introduces mobile edge quantum computing (MEQC), a novel paradigm that integrates quantum capabilities into edge servers to enable efficient offloading of classical and quantum tasks, and proposes a reinforcement learning algorithm that reduces system costs by up to 30%.

Authors:Ziqing Fan, Siyuan Du, Shengchao Hu, Pingjie Wang, Li Shen, Ya Zhang, Dacheng Tao, Yanfeng Wang
Title: Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
Abstract:
Selecting high-quality pre-training data for large language models (LLMs) is crucial for enhancing their overall performance under limited computation budget, improving both training and sample efficiency. Recent advancements in file selection primarily rely on using an existing or trained proxy model to assess the similarity of samples to a target domain, such as high quality sources BookCorpus and Wikipedia. However, upon revisiting these methods, the domain-similarity selection criteria demonstrates a diversity dilemma, i.e.dimensional collapse in the feature space, improving performance on the domain-related tasks but causing severe degradation on generic performance. To prevent collapse and enhance diversity, we propose a DiverSified File selection algorithm (DiSF), which selects the most decorrelated text files in the feature space. We approach this with a classical greedy algorithm to achieve more uniform eigenvalues in the feature covariance matrix of the selected texts, analyzing its approximation to the optimal solution under a formulation of $γ$-weakly submodular optimization problem. Empirically, we establish a benchmark and conduct extensive experiments on the TinyLlama architecture with models from 120M to 1.1B parameters. Evaluating across nine tasks from the Harness framework, DiSF demonstrates a significant improvement on overall performance. Specifically, DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget, and achieving about 1.5x training efficiency and 5x data efficiency.
中文: DiSF是一种多样化的文件选择算法,通过选取特征空间中不相关的文本文件来解决大型语言模型预训练数据中的多样性困境,显著提升了多项任务的整体性能和效率。
English: DiSF is a diversified file selection algorithm that addresses the diversity dilemma in pre-training data for large language models by selecting decorrelated text files, significantly improving overall performance and efficiency across multiple tasks.

Authors:Jiahua Lan, Sen Zhang, Haixia Pan, Ruijun Liu, Li Shen, Dacheng Tao
Title: Neuron-level Balance between Stability and Plasticity in Deep Reinforcement Learning
Abstract:
In contrast to the human ability to continuously acquire knowledge, agents struggle with the stability-plasticity dilemma in deep reinforcement learning (DRL), which refers to the trade-off between retaining existing skills (stability) and learning new knowledge (plasticity). Current methods focus on balancing these two aspects at the network level, lacking sufficient differentiation and fine-grained control of individual neurons. To overcome this limitation, we propose Neuron-level Balance between Stability and Plasticity (NBSP) method, by taking inspiration from the observation that specific neurons are strongly relevant to task-relevant skills. Specifically, NBSP first (1) defines and identifies RL skill neurons that are crucial for knowledge retention through a goal-oriented method, and then (2) introduces a framework by employing gradient masking and experience replay techniques targeting these neurons to preserve the encoded existing skills while enabling adaptation to new tasks. Numerous experimental results on the Meta-World and Atari benchmarks demonstrate that NBSP significantly outperforms existing approaches in balancing stability and plasticity.
中文摘要:本研究提出的神经元级稳定性与可塑性平衡方法(NBSP)通过识别任务关键神经元并采用针对性保护技术,解决了深度强化学习中的稳定性-可塑性困境,在基准测试中显著优于现有方法。
English Summary: The proposed Neuron-level Balance between Stability and Plasticity (NBSP) method addresses the stability-plasticity dilemma in deep reinforcement learning by identifying task-critical neurons and applying targeted preservation techniques, achieving superior performance on benchmark tests compared to current approaches.

Authors:Xiaolei Wang, Chunxuan Xia, Junyi Li, Fanzhe Meng, Lei Huang, Jinpeng Wang, Wayne Xin Zhao, Ji-Rong Wen
Title: Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User
Abstract:
Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations. A fundamental challenge in CRSs lies in effectively understanding user preferences from conversations. User preferences can be multifaceted and complex, posing significant challenges for accurate recommendations even with access to abundant external knowledge. While interaction with users can clarify their true preferences, frequent user involvement can lead to a degraded user experience. To address this problem, we propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs. The simulated user provides feedback to the items recommended by CRSs, enabling them to better capture intricate user preferences through multi-turn interaction. Inspired by generative reward models, we design two types of feedback actions for the simulated user: i.e., generative item scoring, which offers coarse-grained feedback, and attribute-based item critique, which provides fine-grained feedback. To ensure seamless integration, these feedback actions are unified into an instruction-based format, allowing the development of a unified simulated user via instruction tuning on synthesized data. With this simulated user, automatic multi-turn interaction with CRSs can be effectively conducted. Furthermore, to strike a balance between effectiveness and efficiency, we draw inspiration from the paradigm of reward-guided search in complex reasoning tasks and employ beam search for the interaction process. On top of this, we propose an efficient candidate ranking method to improve the recommendation results derived from interaction. Extensive experiments on public datasets demonstrate the effectiveness, efficiency, and transferability of our approach.
中文: 该摘要提出GRSU,一种基于生成奖励模型的模拟用户,通过自动化多轮交互提供粗粒度与细粒度反馈,在减少用户频繁参与的同时提升对话推荐系统对复杂偏好的捕捉能力。
English: The abstract introduces GRSU, a generative reward model-based simulated user that enhances conversational recommendation systems by providing both coarse-grained and fine-grained feedback through automated multi-turn interactions, improving preference capture without frequent user involvement.

Authors:Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation
Abstract:
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
中文: 本文提出维度感知位置嵌入操控(DPE)方法,无需训练即可通过优化RoPE嵌入中的关键维度来扩展大语言模型的上下文窗口,在多项基准测试中超越现有技术,使Llama3等模型无需重新训练就能处理128k标记的长文本。
English: This paper introduces Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free method that extends LLMs' context window by optimizing key dimensions in RoPE embeddings, achieving superior performance over existing techniques and enabling models like Llama3 to handle 128k tokens without retraining.

Authors:Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, Yu-Gang Jiang
Title: Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Abstract:
Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and "LLM-as-a-judge" scoring. Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators.
中文: 本综述探讨大语言模型评估的核心挑战,聚焦从任务导向转向能力导向、从人工评估转向自动化评估两大转变,并剖析模型能力无限扩展下评估泛化不足的根本问题。
English: This survey examines the core challenges of evaluating Large Language Models, highlighting two key transitions toward capability-based and automated assessment while addressing the persistent issue of evaluation generalization as model abilities rapidly expand.

Authors:Xiaowei Yuan, Zhao Yang, Ziyang Huang, Yequan Wang, Siqi Fan, Yiming Ju, Jun Zhao, Kang Liu
Title: Exploiting Contextual Knowledge in LLMs through V-usable Information based Layer Enhancement
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs' internal states. As a result, LLMs remain limited in their ability to fully leverage contextual knowledge. In this paper, we propose Context-aware Layer Enhancement (CaLE), a novel intervention method that enhances the utilization of contextual knowledge within LLMs' internal representations. By employing V-usable information analysis, CaLE strategically amplifies the growth of contextual information at an optimal layer, thereby enriching representations in the final layer. Our experiments demonstrate that CaLE effectively improves context-faithful generation in Question-Answering tasks, particularly in scenarios involving unknown or conflicting contextual knowledge.
Chinese: 大型语言模型常难以生成忠实于上下文的输出,而本文提出的上下文感知层增强(CaLE)方法通过策略性地在最优层放大信息,提升了模型利用上下文知识的能力,从而在问答任务中表现更优。
English: Large Language Models often fail to generate context-faithful outputs, but the proposed Context-aware Layer Enhancement (CaLE) method improves their ability to utilize contextual knowledge by strategically amplifying information at an optimal layer, leading to better performance in Question-Answering tasks.

Authors:Jiajun Shen, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Title: Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation
Abstract:
While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.
Chinese: 本研究提出上下文先验增强引证生成任务及RAEL框架与INTRALIGN方法,通过融合外部与内部知识提升引证可信度,实验表明其跨场景性能优越,并揭示了检索质量、问题类型和模型知识对引证可靠性的重要影响。
English: This study introduces the Context-Prior Augmented Citation Generation task and proposes the RAEL paradigm with INTRALIGN method, which enhances citation trustworthiness by integrating external and internal knowledge, demonstrating superior cross-scenario performance and identifying key factors affecting citation reliability.

Authors:Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay
Abstract:
Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model's ability to revisit and re-explore these ideas later. Consequently, although the LLM's capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.
中文摘要:提出的回顾性回放强化学习(RRL)算法通过让模型重新探索早期有潜力的解决方案,有效解决了大语言模型训练中的探索效率问题,显著提升了复杂推理任务的表现,同时增强了模型的安全性和实用性。
English Summary: The proposed Retrospective Replay-based Reinforcement Learning (RRL) algorithm addresses exploration inefficiency in LLM training by enabling models to revisit promising early-stage solution ideas, significantly enhancing performance on complex reasoning tasks while improving safety and helpfulness.

Authors:Pancheng Zhao, Deng-Ping Fan, Shupeng Cheng, Salman Khan, Fahad Shahbaz Khan, David Clifton, Peng Xu, Jufeng Yang
Title: Deep Learning in Concealed Dense Prediction
Abstract:
Deep learning is developing rapidly and handling common computer vision tasks well. It is time to pay attention to more complex vision tasks, as model size, knowledge, and reasoning capabilities continue to improve. In this paper, we introduce and review a family of complex tasks, termed Concealed Dense Prediction (CDP), which has great value in agriculture, industry, etc. CDP's intrinsic trait is that the targets are concealed in their surroundings, thus fully perceiving them requires fine-grained representations, prior knowledge, auxiliary reasoning, etc. The contributions of this review are three-fold: (i) We introduce the scope, characteristics, and challenges specific to CDP tasks and emphasize their essential differences from generic vision tasks. (ii) We develop a taxonomy based on concealment counteracting to summarize deep learning efforts in CDP through experiments on three tasks. We compare 25 state-of-the-art methods across 12 widely used concealed datasets. (iii) We discuss the potential applications of CDP in the large model era and summarize 6 potential research directions. We offer perspectives for the future development of CDP by constructing a large-scale multimodal instruction fine-tuning dataset, CvpINST, and a concealed visual perception agent, CvpAgent.
中文: 本文介绍了隐蔽密集预测(CDP)这一复杂视觉任务,它需要细粒度感知和推理来检测隐藏于环境中的目标,并综述了其挑战、分类方法及在农业和工业等领域的应用前景。
English: This paper introduces Concealed Dense Prediction (CDP), a complex vision task requiring fine-grained perception and reasoning to detect targets hidden in their surroundings, and reviews its challenges, taxonomy, and future directions with applications in agriculture and industry.

Authors:Junjie Zhang, Beichen Zhang, Wenqi Sun, Hongyu Lu, Wayne Xin Zhao, Yu Chen, Ji-Rong Wen
Title: Slow Thinking for Sequential Recommendation
Abstract:
To develop effective sequential recommender systems, numerous methods have been proposed to model historical user behaviors. Despite the effectiveness, these methods share the same fast thinking paradigm. That is, for making recommendations, these methods typically encodes user historical interactions to obtain user representations and directly match these representations with candidate item representations. However, due to the limited capacity of traditional lightweight recommendation models, this one-step inference paradigm often leads to suboptimal performance. To tackle this issue, we present a novel slow thinking recommendation model, named STREAM-Rec. Our approach is capable of analyzing historical user behavior, generating a multi-step, deliberative reasoning process, and ultimately delivering personalized recommendations. In particular, we focus on two key challenges: (1) identifying the suitable reasoning patterns in recommender systems, and (2) exploring how to effectively stimulate the reasoning capabilities of traditional recommenders. To this end, we introduce a three-stage training framework. In the first stage, the model is pretrained on large-scale user behavior data to learn behavior patterns and capture long-range dependencies. In the second stage, we design an iterative inference algorithm to annotate suitable reasoning traces by progressively refining the model predictions. This annotated data is then used to fine-tune the model. Finally, in the third stage, we apply reinforcement learning to further enhance the model generalization ability. Extensive experiments validate the effectiveness of our proposed method.
中文: 该摘要介绍了STREAM-Rec这一慢思考推荐模型,它通过多步推理过程和三阶段训练框架克服了传统快速思考方法的局限,从而提升了个性化推荐的效果。
English: The abstract introduces STREAM-Rec, a slow thinking recommendation model that overcomes the limitations of traditional fast-thinking methods by employing a multi-step reasoning process and a three-stage training framework to improve personalized recommendations.

Authors:Bowen Zheng, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ji-Rong Wen
Title: Universal Item Tokenization for Transferable Generative Recommendation
Abstract:
Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the tokenizer and the recommender are typically domain-specific, limiting their ability for effective transfer or adaptation to new domains. To this end, we propose UTGRec, a Universal item Tokenization approach for transferable Generative Recommendation. Specifically, we design a universal item tokenizer for encoding rich item semantics by adapting a multimodal large language model (MLLM). By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. To effectively learn the universal item tokenizer on multiple domains, we introduce two key techniques in our approach. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations to capture general knowledge embedded in the content. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction. Finally, we present a joint learning framework to pre-train and adapt the transferable generative recommender across multiple domains. Extensive experiments on four public datasets demonstrate the superiority of UTGRec compared to both traditional and generative recommendation baselines.
中文摘要:UTGRec提出了一种通用项目标记方法,通过多模态大语言模型和树形码本实现可迁移的生成式推荐,结合内容重建与协同知识,在跨领域推荐中展现出卓越性能。
English Summary: UTGRec introduces a universal item tokenization method using multimodal large language models and tree-structured codebooks to enable transferable generative recommendation, integrating content reconstruction and collaborative knowledge for superior cross-domain performance.

Authors:Bowen Zheng, Enze Liu, Zhongfu Chen, Zhongrui Ma, Yue Wang, Wayne Xin Zhao, Ji-Rong Wen
Title: Pre-training Generative Recommender with Multi-Identifier Item Tokenization
Abstract:
Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.
中文: MTGRec通过多标识符项目标记化和课程预训练方案,解决了生成式推荐中语义建模不足和数据多样性受限的问题,在效果和可扩展性上显著优于现有方法。
English: MTGRec introduces multi-identifier item tokenization and curriculum pre-training to enhance generative recommendation by addressing semantic modeling and data diversity limitations, significantly outperforming existing methods in effectiveness and scalability.

Authors:Ziqing Fan, Cheng Liang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
Abstract:
Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. We construct a large dataset by extracting and refining reasoning chains from routine radiology reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards. We introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and propose RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. ChestX-Reasoner outperforms existing medical and general-domain MLLMs in both diagnostic accuracy and reasoning ability, achieving 16%, 5.9%, and 18% improvements in reasoning ability compared to the best medical MLLM, the best general MLLM, and its base model, respectively, as well as 3.3%, 24%, and 27% improvements in outcome accuracy. All resources are open-sourced to facilitate further research in medical reasoning MLLMs.
Chinese: ChestX-Reasoner是一种放射学诊断多模态大语言模型,通过利用临床报告中的过程监督和两阶段训练框架,在诊断准确性和推理能力上超越现有模型,所有资源均已开源。
English: ChestX-Reasoner is a radiology diagnosis multimodal LLM that uses process supervision from clinical reports and a two-stage training framework to outperform existing models in diagnostic accuracy and reasoning ability, with all resources open-sourced.

Authors:Jiaan Wang, Fandong Meng, Jie Zhou
Title: DeepTrans: Deep Reasoning Translation via Reinforcement Learning
Abstract:
Recently, deep reasoning LLMs (e.g., OpenAI o1 and DeepSeek-R1) have shown promising performance in various downstream tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation. However, the task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning (RL). Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought processes. The reward model teaches DeepTrans how to think and free-translate the given sentences during RL. Besides, our RL training does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning LLMs. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.
中文摘要:DeepTrans通过强化学习训练深度推理翻译模型,无需标注数据即可实现自由翻译,在文学翻译中显著超越现有模型。
English Summary: DeepTrans, a deep reasoning translation model using reinforcement learning, effectively learns free translation without labeled data and significantly outperforms existing models in literature translation.

Authors:Jiaan Wang, Fandong Meng, Jie Zhou
Title: DeepTrans: Deep Reasoning Translation via Reinforcement Learning
Abstract:
Recently, deep reasoning LLMs (e.g., OpenAI o1 and DeepSeek-R1) have shown promising performance in various downstream tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation. However, the task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning (RL). Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought processes. The reward model teaches DeepTrans how to think and free-translate the given sentences during RL. Besides, our RL training does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning LLMs. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.
中文摘要:DeepTrans通过强化学习训练深度推理翻译模型,无需标注数据即可实现自由翻译,在文学翻译中显著超越现有模型。
English Summary: DeepTrans, a deep reasoning translation model using reinforcement learning, effectively learns free translation without labeled data and significantly outperforms existing models in literature translation.

Authors:Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
Title: VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Abstract:
Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.
中文: VisuLogic是一个旨在评估多模态模型真实视觉推理能力的新基准,结果显示当前模型表现不佳,准确率低于30%,远落后于人类水平。
English: VisuLogic is a new benchmark designed to evaluate genuine visual reasoning in multimodal models, revealing that current models perform poorly with accuracy below 30%, significantly lagging behind human capabilities.

Authors:Da Li, Keping Bi, Jiafeng Guo, Xueqi Cheng
Title: Bridging Queries and Tables through Entities in Table Retrieval
Abstract:
Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval. In this work, we explore how to leverage entities in tables to improve retrieval performance. First, we investigate the important role of entities in table retrieval from a statistical perspective and propose an entity-enhanced training framework. Subsequently, we use the type of entities to highlight entities instead of introducing an external knowledge base. Moreover, we design an interaction paradigm based on entity representations. Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that our proposed framework is both simple and effective in enhancing existing retrievers. We also conduct extensive analyses to confirm the efficacy of different components. Overall, our work provides a promising direction for elevating table retrieval, enlightening future research in this area.
中文: 本研究提出了一种基于实体的增强训练框架,无需外部知识即可利用表格实体提升检索性能,实证结果和组件分析验证了其在基准测试中的有效性。
English: This study introduces an entity-enhanced training framework that leverages table entities without external knowledge to improve table retrieval performance, demonstrating effectiveness on benchmarks through empirical results and component analysis.

Authors:Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng
Title: Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
Abstract:
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.
中文: 本研究证明,利用大语言模型标注文档效用训练检索与检索增强生成系统,可显著提升跨域检索性能,且仅需结合20%人工标注即可达到全人工标注的同等效果。
English: This study demonstrates that using large language models to annotate document utility for training retrieval and retrieval-augmented generation systems significantly improves out-of-domain performance and achieves comparable results to full human annotations when combined with just 20% human labels.

Authors:Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng
Title: Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
Abstract:
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.
中文: 本研究证明,利用大语言模型标注文档效用训练检索与检索增强生成系统,可显著提升跨域检索性能,且仅需结合20%人工标注即可达到全人工标注的同等效果。
English: This study demonstrates that using large language models to annotate document utility for training retrieval and retrieval-augmented generation systems significantly improves out-of-domain performance and achieves comparable results to full human annotations when combined with just 20% human labels.

Authors:Hengran Zhang, Keping Bi, Jiafeng Guo, Xiaojie Sun, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng
Title: Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling
Abstract:
Dense retrieval is a crucial task in Information Retrieval (IR), serving as the basis for downstream tasks such as re-ranking and augmenting generation. Recently, large language models (LLMs) have demonstrated impressive semantic understanding capabilities, making them attractive to researchers focusing on dense retrieval. While LLMs, as decoder-style generative models, excel in language generation, they often fall short in modeling global information due to a lack of attention to subsequent tokens. Drawing inspiration from the classical word-based language modeling approach for IR, specifically the query likelihood (QL) model, we aim to leverage the generative strengths of LLMs through QL maximization. Rather than employing QL estimation for document ranking, we propose an auxiliary task of QL maximization to enhance the backbone for subsequent contrastive learning of the retriever. We introduce our model, LLM-QL, which incorporates two key components: Attention Block (AB) and Document Corruption (DC). AB blocks the attention of predictive tokens to the document tokens before the document's ending token, while DC corrupts a document by masking a portion of its tokens during prediction. Evaluations on the in-domain (MS MARCO) and out-of-domain dataset (BEIR) indicate LLM-QL's superiority over other LLM-based retrievers. Furthermore, comprehensive analyses also validate the efficacy of LLM-QL and its components.
中文: 提出的LLM-QL模型通过结合查询似然最大化、注意力阻断和文档破坏技术来增强密集检索,在领域内和领域外数据集上均表现出优于其他基于大语言模型的检索器的性能。
English: The proposed LLM-QL model enhances dense retrieval by integrating query likelihood maximization with attention blocking and document corruption, demonstrating superior performance on both in-domain and out-of-domain datasets compared to other LLM-based retrievers.

Authors:Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang
Title: PixelHacker: Image Inpainting with Structural and Semantic Consistency
Abstract:
Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.
中文: PixelHacker提出了一种新颖的潜在类别引导范式,通过分别编码前景背景特征并在去噪过程中注入,有效解决了图像修复中的结构与语义难题,在多个数据集上全面超越现有最优方法。
English: PixelHacker introduces a novel latent categories guidance paradigm and a diffusion-based model that significantly outperforms state-of-the-art methods by addressing structural and semantic challenges in image inpainting through separate foreground/background embeddings and linear attention.

Authors:Yunze Deng, Haijun Xiong, Bin Feng, Xinggang Wang, Wenyu Liu
Title: STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting
Abstract:
Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.
中文: STP4D通过三个精心设计的模块整合时空-提示一致性建模,率先结合扩散模型与4D高斯技术,仅需4.6秒即可生成高保真文本驱动4D内容,在质量与速度上均超越现有方法。
English: STP4D introduces a unified framework integrating spatio-temporal-prompt consistency modeling through three specialized modules, leveraging Diffusion models with 4D Gaussians to achieve high-fidelity text-to-4D generation in just 4.6 seconds per asset, outperforming existing methods in quality and efficiency.

Authors:Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma
Title: Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions
Abstract:
The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.
中文摘要:本综述分析了知识蒸馏与数据集蒸馏这两种互补方法,旨在压缩大型语言模型的同时保持其核心能力,通过整合策略解决多领域应用中的可扩展性与性能保障问题。
English Summary: This survey analyzes Knowledge Distillation and Dataset Distillation as complementary methods for compressing Large Language Models while maintaining their capabilities, exploring integrated approaches to address scalability and performance challenges across various domains.

Authors:Ziyu Liu, Lintao Tang, Zeliang Sun, Zhengliang Liu, Yanjun Lyu, Wei Ruan, Yangshuang Xu, Liang Shan, Jiyoon Shin, Xiaohe Chen, Dajiang Zhu, Tianming Liu, Rongjie Liu, Chao Huang
Title: AD-GPT: Large Language Models in Alzheimer's Disease
Abstract:
Large language models (LLMs) have emerged as powerful tools for medical information retrieval, yet their accuracy and depth remain limited in specialized domains such as Alzheimer's disease (AD), a growing global health challenge. To address this gap, we introduce AD-GPT, a domain-specific generative pre-trained transformer designed to enhance the retrieval and analysis of AD-related genetic and neurobiological information. AD-GPT integrates diverse biomedical data sources, including potential AD-associated genes, molecular genetic information, and key gene variants linked to brain regions. We develop a stacked LLM architecture combining Llama3 and BERT, optimized for four critical tasks in AD research: (1) genetic information retrieval, (2) gene-brain region relationship assessment, (3) gene-AD relationship analysis, and (4) brain region-AD relationship mapping. Comparative evaluations against state-of-the-art LLMs demonstrate AD-GPT's superior precision and reliability across these tasks, underscoring its potential as a robust and specialized AI tool for advancing AD research and biomarker discovery.
Chinese: AD-GPT是一种专业的人工智能模型,整合生物医学数据以提升阿尔茨海默病相关遗传和神经生物学信息的检索与分析能力,在关键研究任务中展现出优于现有大语言模型的精确性和可靠性。
English: AD-GPT is a specialized AI model that integrates biomedical data to enhance the retrieval and analysis of Alzheimer's disease-related genetic and neurobiological information, outperforming existing large language models in precision and reliability for key research tasks.

Authors:Antonio A. Ginart, Naveen Kodali, Jason Lee, Caiming Xiong, Silvio Savarese, John R. Emmons
Title: LZ Penalty: An information-theoretic repetition penalty for autoregressive language models
Abstract:
We introduce the LZ penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables state-of-the-art open-source reasoning models to operate with greedy (temperature zero) decoding without loss of capability and without instances of degenerate repetition. Both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4%.
中文: LZ惩罚机制能有效消除自回归语言模型在贪婪解码中的退化重复,同时保持模型性能,显著优于传统的频率和重复惩罚方法。
English: The LZ penalty effectively eliminates degenerate repetitions in autoregressive language models during greedy decoding without compromising performance, outperforming traditional frequency and repetition penalties.

Authors:Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
Title: DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Abstract:
We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.
中文: DyMU是一种无需训练的框架,通过根据图像复杂度动态合并视觉标记并重建注意力动态,显著降低视觉语言模型的计算负担(减少32%-85%的标记数量),同时在不同任务中保持性能。
English: DyMU is a training-free framework that dynamically reduces the computational cost of vision-language models by merging visual tokens based on image complexity and reconstructing attention dynamics, achieving significant token reduction (32%-85%) while maintaining performance across various tasks.

Authors:Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, Weinan Zhang
Title: A Survey of AI Agent Protocols
Abstract:
The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two-dimensional classification that differentiates context-oriented versus inter-agent protocols and general-purpose versus domain-specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next-generation protocols. These characteristics include adaptability, privacy preservation, and group-based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.
中文: 大语言模型代理在各行业的广泛应用因缺乏标准化通信协议而受限,本文系统分析了现有协议并提出下一代协议的关键方向,强调适应性、安全性和群体智能等特性。
English: The rapid deployment of LLM agents across industries is hindered by the lack of standardized communication protocols, prompting a comprehensive analysis of existing methods and a proposal for next-generation protocols emphasizing adaptability, security, and collective intelligence.

Authors:Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty
Title: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
Abstract:
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. ...
中文: 本综述从推理阶段和系统架构两个正交维度对大语言模型的推理方法进行分类,通过输入输出层面分析技术路径,并重点探讨了从推理扩展转向学习推理、智能体工作流等新兴趋势。
English: This survey systematically categorizes reasoning methods in large language models along regimes and architectures, analyzing input and output level techniques while highlighting trends like learning-to-reason and agentic workflows.

Authors:Junjie Zhang, Rushuai Yang, Shunyu Liu, Ting-En Lin, Fei Huang, Yi Chen, Yongbin Li, Dacheng Tao
Title: Supervised Optimism Correction: Be Confident When LLMs Are Sure
Abstract:
In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.
中文: 本研究揭示了监督微调在大语言模型中隐式学习Q函数,暴露了集束搜索的过度乐观问题,并提出监督乐观校正方法通过价值正则化抑制该现象,在数学推理基准测试中取得了优异表现。
English: This study reveals that supervised fine-tuning in large language models implicitly learns a Q-function, exposing beam search's over-optimism issue, and introduces Supervised Optimism Correction to suppress this through value regularization, achieving superior results on reasoning benchmarks.

Authors:Lingyue Fu, Ting Long, Jianghao Lin, Wei Xia, Xinyi Dai, Ruiming Tang, Yasheng Wang, Weinan Zhang, Yong Yu
Title: AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing
Abstract:
Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences. Existing KT models typically follow a single-step training paradigm, which leads to discrepancies with the multi-step inference process required in real-world simulations, resulting in significant error accumulation. This accumulation of error, coupled with the issue of data sparsity, can substantially degrade the performance of recommendation models in the intelligent tutoring systems. To address these challenges, we propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT), which, for the first time, focuses on the multi-step KT task. More specifically, AdvKT leverages adversarial learning paradigm involving a generator and a discriminator. The generator mimics high-reward responses, effectively reducing error accumulation across multiple steps, while the discriminator provides feedback to generate synthetic data. Additionally, we design specialized data augmentation techniques to enrich the training data with realistic variations, ensuring that the model generalizes well even in scenarios with sparse data. Experiments conducted on four real-world datasets demonstrate the superiority of AdvKT over existing KT models, showcasing its ability to address both error accumulation and data sparsity issues effectively.
中文摘要:提出的AdvKT框架通过对抗性学习和数据增强技术,有效解决了知识追踪中的误差累积和数据稀疏问题,在多个真实数据集上展现出卓越性能。
English Summary: The proposed AdvKT framework employs adversarial learning and data augmentation to mitigate error accumulation and data sparsity in knowledge tracing, demonstrating superior performance across multiple datasets.

Authors:Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke
Title: Entropy-Based Block Pruning for Efficient Large Language Models
Abstract:
As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
中文摘要:本研究提出一种基于信息熵的Transformer模型剪枝方法,通过利用熵作为比余弦相似度更有效的信息丰富度衡量指标,在保持精度的同时更有效地减小模型规模。
English Summary: This study introduces an entropy-based pruning method for Transformer models, which more effectively reduces model size while maintaining accuracy by leveraging entropy as a superior measure of information richness compared to cosine similarity.

Authors:Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong
Title: APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Abstract:
Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $τ$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io
中文: APIGen-MT通过任务蓝图和模拟交互的两阶段框架生成可验证的多样化多轮智能体数据,其训练的xLAM-2-fc-r模型在多轮基准测试中超越GPT-4o和Claude 3.5,同时保持更优的一致性表现。
English: APIGen-MT is a two-phase framework that generates verifiable, diverse multi-turn agent data through task blueprints and simulated interactions, enabling the training of xLAM-2-fc-r models that outperform GPT-4o and Claude 3.5 in multi-turn benchmarks while maintaining superior consistency.

Authors:Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng
Title: Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models
Abstract:
Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.
中文: SCARLet框架通过多任务泛化和段落间交互训练检索增强语言模型中的效用检索器,有效提升了各类任务的整体性能。
English: SCARLet is a framework that trains utility-based retrievers in Retrieval-Augmented Language Models by incorporating multi-task generalization and inter-passage interaction to enhance overall performance across various tasks.

Authors:Yixian Wang, Geng Sun, Zemin Sun, Jiacheng Wang, Jiahui Li, Changyuan Zhao, Jing Wu, Shuang Liang, Minghao Yin, Pengfei Wang, Dusit Niyato, Sumei Sun, Dong In Kim
Title: Toward Realization of Low-Altitude Economy Networks: Core Architecture, Integrated Technologies, and Future Directions
Abstract:
The rise of the low-altitude economy (LAE) is propelling urban development and emerging industries by integrating advanced technologies to enhance efficiency, safety, and sustainability in low-altitude operations. The widespread adoption of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft plays a crucial role in enabling key applications within LAE, such as urban logistics, emergency rescue, and aerial mobility. However, unlike traditional UAV networks, LAE networks encounter increased airspace management demands due to dense flying nodes and potential interference with ground communication systems. In addition, there are heightened and extended security risks in real-time operations, particularly the vulnerability of low-altitude aircraft to cyberattacks from ground-based threats. To address these, this paper first explores related standards and core architecture that support the development of LAE networks. Subsequently, we highlight the integration of technologies such as communication, sensing, computing, positioning, navigation, surveillance, flight control, and airspace management. This synergy of multi-technology drives the advancement of real-world LAE applications, particularly in improving operational efficiency, optimizing airspace usage, and ensuring safety. Finally, we outline future research directions for LAE networks, such as intelligent and adaptive optimization, security and privacy protection, sustainable energy and power management, quantum-driven coordination, generative governance, and three-dimensional (3D) airspace coverage, which collectively underscore the potential of collaborative technologies to advance LAE networks.
中文: 低空经济通过无人机等技术推动城市发展,但面临空域管理和安全风险等挑战;本文探讨了多技术融合解决方案,并规划了智能化、安全保护等未来研究方向。
English: The low-altitude economy (LAE) is advancing urban development through technologies like UAVs and eVTOLs, but faces challenges in airspace management and cybersecurity, which this paper addresses by exploring integrated solutions and outlining future research directions.

Authors:Ruichen Zhang, Yinqiu Liu, Shunpu Tang, Jiacheng Wang, Dusit Niyato, Geng Sun, Yonghui Li, Sumei Sun
Title: Covert Prompt Transmission for Secure Large Language Model Services
Abstract:
This paper investigates covert prompt transmission for secure and efficient large language model (LLM) services over wireless networks. We formulate a latency minimization problem under fidelity and detectability constraints to ensure confidential and covert communication by jointly optimizing the transmit power and prompt compression ratio. To solve this problem, we first propose a prompt compression and encryption (PCAE) framework, performing surprisal-guided compression followed by lightweight permutation-based encryption. Specifically, PCAE employs a locally deployed small language model (SLM) to estimate token-level surprisal scores, selectively retaining semantically critical tokens while discarding redundant ones. This significantly reduces computational overhead and transmission duration. To further enhance covert wireless transmission, we then develop a group-based proximal policy optimization (GPPO) method that samples multiple candidate actions for each state, selecting the optimal one within each group and incorporating a Kullback-Leibler (KL) divergence penalty to improve policy stability and exploration. Simulation results show that PCAE achieves comparable LLM response fidelity to baseline methods while reducing preprocessing latency by over five orders of magnitude, enabling real-time edge deployment. We further validate PCAE effectiveness across diverse LLM backbones, including DeepSeek-32B, Qwen-32B, and their smaller variants. Moreover, GPPO reduces covert transmission latency by up to 38.6\% compared to existing reinforcement learning strategies, with further analysis showing that increased transmit power provides additional latency benefits.
中文: 本文提出了一种提示压缩加密框架和分组近端策略优化方法,在保证大型语言模型服务响应质量和安全性的同时,显著降低了无线隐蔽传输的延迟。
English: This paper introduces a prompt compression and encryption (PCAE) framework combined with a group-based proximal policy optimization (GPPO) method to minimize latency in covert wireless transmission of large language model services while maintaining response fidelity and security.

Authors:Chuang Zhang, Geng Sun, Jiahui Li, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Shiwen Mao, Tony Q. S. Quek
Title: Aerial Active STAR-RIS-assisted Satellite-Terrestrial Covert Communications
Abstract:
An integration of satellites and terrestrial networks is crucial for enhancing performance of next generation communication systems. However, the networks are hindered by the long-distance path loss and security risks in dense urban environments. In this work, we propose a satellite-terrestrial covert communication system assisted by the aerial active simultaneous transmitting and reflecting reconfigurable intelligent surface (AASTAR-RIS) to improve the channel capacity while ensuring the transmission covertness. Specifically, we first derive the minimal detection error probability (DEP) under the worst condition that the Warden has perfect channel state information (CSI). Then, we formulate an AASTAR-RIS-assisted satellite-terrestrial covert communication optimization problem (ASCCOP) to maximize the sum of the fair channel capacity for all ground users while meeting the strict covert constraint, by jointly optimizing the trajectory and active beamforming of the AASTAR-RIS. Due to the challenges posed by the complex and high-dimensional state-action spaces as well as the need for efficient exploration in dynamic environments, we propose a generative deterministic policy gradient (GDPG) algorithm, which is a generative deep reinforcement learning (DRL) method to solve the ASCCOP. Concretely, the generative diffusion model (GDM) is utilized as the policy representation of the algorithm to enhance the exploration process by generating diverse and high-quality samples through a series of denoising steps. Moreover, we incorporate an action gradient mechanism to accomplish the policy improvement of the algorithm, which refines the better state-action pairs through the gradient ascent. Simulation results demonstrate that the proposed approach significantly outperforms important benchmarks.
中文: 本文提出了一种空中主动式同步收发可重构智能表面辅助的星地隐蔽通信系统,通过创新的生成式深度强化学习算法在保证传输隐蔽性的同时提升信道容量,显著优于现有基准方法。
English: This paper proposes an aerial active simultaneous transmitting and reflecting reconfigurable intelligent surface-assisted satellite-terrestrial covert communication system to enhance channel capacity while ensuring transmission security, using a novel generative deep reinforcement learning algorithm that significantly outperforms existing benchmarks.

Authors:Geng Sun, Jia Qi, Chuang Zhang, Xuejie Liu, Jiacheng Wang, Dusit Niyato, Yuanwei Liu, Dong In Kim
Title: Generative Artificial Intelligence for Beamforming in Low-Altitude Economy
Abstract:
The growth of low-altitude economy (LAE) has driven a rising demand for efficient and secure communication. However, conventional beamforming optimization techniques struggle in the complex LAE environments. In this context, generative artificial intelligence (GenAI) methods provide a promising solution. In this article, we first introduce the core concepts of LAE and the roles of beamforming in advanced communication technologies for LAE. We then examine their interrelation, followed by an analysis of the limitations of conventional beamforming methods. Next, we provide an overview of how GenAI methods enhance the process of beamforming, with a focus on its applications in LAE. Furthermore, we present a case study using a generative diffusion model (GDM)-based algorithm to enhance the performance of aerial collaborative beamforming-enabled remote secure communications in LAE and simulation results verified the effectiveness of the proposed algorithms. Finally, promising research opportunities are identified.
中文: 低空经济的发展对高效通信提出需求,而生成式人工智能为优化波束成形提供了有效解决方案,通过基于生成扩散模型的案例研究验证了其在安全通信中的性能提升。
English: The growth of the low-altitude economy demands efficient communication, and generative AI offers a promising solution to enhance beamforming, as demonstrated by a case study using a generative diffusion model for secure aerial communications.

Authors:Xin Tang, Qian Chen, Wenjie Weng, Chao Jin, Zhang Liu, Jiacheng Wang, Geng Sun, Xiaohuan Li, Dusit Niyato
Title: Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning
Abstract:
The integration of emerging uncrewed aerial vehicles (UAVs) with artificial intelligence (AI) and ground-embedded robots (GERs) has transformed emergency rescue operations in unknown environments. However, the high computational demands often exceed a single UAV's capacity, making it difficult to continuously provide stable high-level services. To address this, this paper proposes a cooperation framework involving UAVs, GERs, and airships. The framework enables resource pooling through UAV-to-GER (U2G) and UAV-to-airship (U2A) links, offering computing services for offloaded tasks. Specifically, we formulate the multi-objective problem of task assignment and exploration as a dynamic long-term optimization problem aiming to minimize task completion time and energy use while ensuring stability. Using Lyapunov optimization, we transform it into a per-slot deterministic problem and propose HG-MADDPG, which combines the Hungarian algorithm with a GDM-based multi-agent deep deterministic policy gradient. Simulations demonstrate significant improvements in offloading efficiency, latency, and system stability over baselines.
中文: 本文提出了一种无人机、地面机器人和飞艇协同框架,通过HG-MADDPG算法优化任务分配与探索,有效解决了应急救援中计算资源不足的问题,显著提升了系统效率和稳定性。
English: This paper introduces a cooperative framework integrating UAVs, ground-embedded robots, and airships to address computational limitations in emergency rescue operations by optimizing task assignment and exploration through a novel HG-MADDPG algorithm, significantly enhancing efficiency and stability.

Authors:Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai
Title: FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment
Abstract:
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
中文: 本研究提出了首个大规模人脸视频质量评估数据集FVQ-20K,并开发了FVQ-Rater方法,通过融合多模态特征和指令微调技术实现类人化质量评分,为推进人脸视频质量评估领域发展展现出重要潜力。
English: This study introduces FVQ-20K, the first large-scale dataset for face video quality assessment (FVQA), and proposes FVQ-Rater, a novel method leveraging multimodal features and instruction tuning to achieve human-like quality evaluation, demonstrating significant potential for advancing FVQA research.

Authors:Lingyi Cai, Jiacheng Wang, Ruichen Zhang, Yu Zhang, Tao Jiang, Dusit Niyato, Xianbin Wang, Abbas Jamalipour, Xuemin Shen
Title: Secure Physical Layer Communications for Low-Altitude Economy Networking: A Survey
Abstract:
The Low-Altitude Economy Networking (LAENet) is emerging as a transformative paradigm that enables an integrated and sophisticated communication infrastructure to support aerial vehicles in carrying out a wide range of economic activities within low-altitude airspace. However, the physical layer communications in the LAENet face growing security threats due to inherent characteristics of aerial communication environments, such as signal broadcast nature and channel openness. These challenges highlight the urgent need for safeguarding communication confidentiality, availability, and integrity. In view of the above, this survey comprehensively reviews existing secure countermeasures for physical layer communication in the LAENet. We explore core methods focusing on anti-eavesdropping and authentication for ensuring communication confidentiality. Subsequently, availability-enhancing techniques are thoroughly discussed for anti-jamming and spoofing defense. Then, we review approaches for safeguarding integrity through anomaly detection and injection protection. Furthermore, we discuss future research directions, emphasizing energy-efficient physical layer security, multi-drone collaboration for secure communication, AI-driven security defense strategy, space-air-ground integrated security architecture, and 6G-enabled secure UAV communication. This survey may provide valuable references and new insights for researchers in the field of secure physical layer communication for the LAENet.
中文摘要:低空经济网络(LAENet)作为支持低空经济活动的变革性通信基础设施,其物理层通信面临严重安全威胁,本文系统综述了现有防护措施并展望了未来研究方向。
English Summary: LAENet is a transformative communication infrastructure for low-altitude economic activities but faces significant physical layer security threats, prompting this comprehensive survey of existing countermeasures and future research directions.

Authors:Zhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren
Title: Replication and Exploration of Generative Retrieval over Dynamic Corpora
Abstract:
Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.
中文: 生成式检索模型采用基于文本的文档标识符在动态语料库中表现出卓越的泛化能力,其细粒度设计甚至超越传统方法;而数值型标识符因过度拟合训练集导致性能下降,需通过新型混合标识符设计兼顾效率与效果。
English: Generative retrieval models with text-based document identifiers demonstrate superior adaptability to evolving document collections, outperforming traditional methods and matching dense retrieval performance, while numeric-based identifiers suffer from overfitting and require a novel hybrid approach for efficiency and effectiveness.

Authors:Erxue Min, Hsiu-Yuan Huang, Xihong Yang, Min Yang, Xin Jia, Yunfang Wu, Hengyi Cai, Junfeng Wang, Shuaiqiang Wang, Dawei Yin
Title: From Prompting to Alignment: A Generative Framework for Query Recommendation
Abstract:
In modern search systems, search engines often suggest relevant queries to users through various panels or components, helping refine their information needs. Traditionally, these recommendations heavily rely on historical search logs to build models, which suffer from cold-start or long-tail issues. Furthermore, tasks such as query suggestion, completion or clarification are studied separately by specific design, which lacks generalizability and hinders adaptation to novel applications. Despite recent attempts to explore the use of LLMs for query recommendation, these methods mainly rely on the inherent knowledge of LLMs or external sources like few-shot examples, retrieved documents, or knowledge bases, neglecting the importance of the calibration and alignment with user feedback, thus limiting their practical utility. To address these challenges, we first propose a general Generative Query Recommendation (GQR) framework that aligns LLM-based query generation with user preference. Specifically, we unify diverse query recommendation tasks by a universal prompt framework, leveraging the instruct-following capability of LLMs for effective generation. Secondly, we align LLMs with user feedback via presenting a CTR-alignment framework, which involves training a query-wise CTR predictor as a process reward model and employing list-wise preference alignment to maximize the click probability of the generated query list. Furthermore, recognizing the inconsistency between LLM knowledge and proactive search intents arising from the separation of user-initiated queries from models, we align LLMs with user initiative via retrieving co-occurrence queries as side information when historical logs are available.
中文摘要:本文提出生成式查询推荐(GQR)框架,通过通用提示结构统一各类查询任务,并采用CTR对齐和共现查询检索使大语言模型与用户偏好及主动搜索意图保持一致,从而解决传统方法的冷启动与长尾问题。
English Summary: This paper introduces a Generative Query Recommendation (GQR) framework that unifies various query tasks through a universal prompt system and aligns LLMs with user preferences using CTR-based feedback and co-occurrence query retrieval to overcome traditional limitations.

Authors:Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, Hao Liu
Title: TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning
Abstract:
Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs' intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.
中文摘要:本文提出了首个面向检索增强时空感知旅行规划的基准TP-RAG,并开发了EvoRAG进化框架,通过融合检索轨迹与大语言模型的内在推理,显著提升了旅行规划的时空合理性与常识合规性。
English Summary: This paper introduces TP-RAG, the first benchmark for retrieval-augmented travel planning that addresses spatiotemporal rationality, and proposes EvoRAG, an evolutionary framework that synergizes retrieved trajectories with LLMs to achieve state-of-the-art performance in travel planning.

Authors:Haoran Yan, Yinfang Chen, Minghua Ma, Ming Wen, Shan Lu, Shenglin Zhang, Tianyin Xu, Rujia Wang, Chetan Bansal, Saravan Rajmohan, Qingwei Lin, Chaoyun Zhang, Dongmei Zhang
Title: An Empirical Study of Production Incidents in Generative AI Cloud Services
Abstract:
The ever-increasing demand for generative artificial intelligence (GenAI) has motivated cloud-based GenAI services such as Azure OpenAI Service and Amazon Bedrock. Like any large-scale cloud service, failures are inevitable in cloud-based GenAI services, resulting in user dissatisfaction and significant monetary losses. However, GenAI cloud services, featured by their massive parameter scales, hardware demands, and usage patterns, present unique challenges, including generated content quality issues and privacy concerns, compared to traditional cloud services. To understand the production reliability of GenAI cloud services, we analyzed production incidents from a leading GenAI cloud service provider spanning in the past four years. Our study (1) presents the general characteristics of GenAI cloud service incidents at different stages of the incident life cycle; (2) identifies the symptoms and impacts of these incidents on GenAI cloud service quality and availability; (3) uncovers why these incidents occurred and how they were resolved; (4) discusses open research challenges in terms of incident detection, triage, and mitigation, and sheds light on potential solutions.
Chinese: 随着生成式AI需求的增长,Azure OpenAI和亚马逊Bedrock等云服务面临内容质量和隐私等独特可靠性挑战,一项为期四年的生产事件研究分析了其原因、影响及解决方案。
English: The growing demand for generative AI has led to cloud services like Azure OpenAI and Amazon Bedrock, which face unique reliability challenges such as content quality and privacy issues, prompting a four-year study of incidents to analyze causes, impacts, and solutions.

Authors:Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, Chaochao Lu
Title: ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge--augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore--exploit tradeoff arises in multi--branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state--of--the--art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%. Our project page is at https://opencausalab.github.io/ARise.
Chinese: ARise是一种新颖框架,通过将风险评估和动态检索增强生成结合到蒙特卡洛树搜索中,显著提升了大语言模型的推理能力,其性能比现有最优方法高出最多25.37%。
English: ARise is a novel framework that enhances reasoning in large language models by integrating risk assessment and dynamic retrieval-augmented generation within a Monte Carlo tree search, significantly outperforming existing methods by up to 25.37%.

Authors:Ruoxi Xu, Yunjie Ji, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Yingfei Sun, Xiangang Li, Le Sun
Title: Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning
Abstract:
Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.
中文摘要:本文提出了一个四层知识注入框架和DeepKnowledge测试平台,系统评估大语言模型对新知识、增量知识和更新知识的吸收深度,揭示了实现不同知识注入层次的关键因素与适配方法。
English Summary: This paper introduces a four-tier knowledge injection framework and the DeepKnowledge testbed to systematically evaluate how deeply large language models can absorb and utilize novel, incremental, and updated knowledge, identifying key factors and methods for achieving different levels of injection.

Authors:Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma
Title: Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection
Abstract:
The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at https://realiad4ad.github.io/Real-IAD D3
中文: Real-IAD D3数据集通过整合RGB图像、三维点云和伪三维深度信息,提出了一种高精度多模态工业异常检测方法,显著提升了20个类别中的检测鲁棒性和性能。
English: The Real-IAD D3 dataset introduces a high-precision multimodal approach for industrial anomaly detection, combining RGB, 3D point clouds, and pseudo-3D depth to enhance detection robustness and performance across 20 categories.

Authors:Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, Wenwu Zhu
Title: Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning
Abstract:
Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.
Chinese: Embodied-R是一个协作框架,通过结合大规模视觉语言模型进行感知和小规模语言模型进行推理,在强化学习与新颖奖励机制下实现了具身空间推理任务的最先进性能。
English: Embodied-R is a collaborative framework that integrates large-scale Vision-Language Models for perception with small-scale Language Models for reasoning, achieving state-of-the-art performance in embodied spatial reasoning tasks through reinforcement learning with a novel reward system.

Authors:Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, Songwei Li, Yunke Zhang, Yuming Lin, Tong Li, Jingtao Ding, Chen Gao, Fengli Xu, Yong Li
Title: A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science
Abstract:
Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.
中文: 本文探讨了空间智能在导航和地球科学等领域的跨学科特性,分析了大型语言模型中空间认知、记忆和推理如何在不同尺度上相互关联并发挥作用。
English: This paper explores the interdisciplinary nature of spatial intelligence across fields like navigation and earth science, analyzing how spatial cognition, memory, and reasoning in large language models connect and function across different scales.

Authors:Haotian Xu, Yue Hu, Chen Gao, Zhengqiu Zhu, Yong Zhao, Yong Li, Quanjun Yin
Title: GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation
Abstract:
Language-goal aerial navigation is a critical challenge in embodied AI, requiring UAVs to localize targets in complex environments such as urban blocks based on textual specification. Existing methods, often adapted from indoor navigation, struggle to scale due to limited field of view, semantic ambiguity among objects, and lack of structured spatial reasoning. In this work, we propose GeoNav, a geospatially aware multimodal agent to enable long-range navigation. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial strategies. To support such reasoning, it dynamically builds two different types of spatial memory. The first is a global but schematic cognitive map, which fuses prior textual geographic knowledge and embodied visual cues into a top-down, annotated form for fast navigation to the landmark region. The second is a local but delicate scene graph representing hierarchical spatial relationships between blocks, landmarks, and objects, which is used for definite target localization. On top of this structured representation, GeoNav employs a spatially aware, multimodal chain-of-thought prompting mechanism to enable multimodal large language models with efficient and interpretable decision-making across stages. On the CityNav urban navigation benchmark, GeoNav surpasses the current state-of-the-art by up to 12.53% in success rate and significantly improves navigation efficiency, even in hard-level tasks. Ablation studies highlight the importance of each module, showcasing how geospatial representations and coarse-to-fine reasoning enhance UAV navigation.
中文: 该摘要介绍了GeoNav,一种具备地理空间感知能力的多模态智能体,通过采用三阶段由粗到精的导航策略和双重空间记忆系统,在城市导航基准测试中实现了最先进的性能表现。
English: This abstract introduces GeoNav, a geospatially aware multimodal agent that enhances UAV navigation by employing a three-phase coarse-to-fine strategy and dual spatial memory systems, achieving state-of-the-art performance on urban navigation benchmarks.

Authors:Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li
Title: An Evaluation of Cultural Value Alignment in LLM
Abstract:
LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs' cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.
中文: 本研究对大型语言模型的文化对齐性进行了大规模评估,发现尽管模型输出总体上呈现温和的文化中立立场,但美国是文化对齐最佳的国家,GLM-4模型在文化价值对齐方面表现最优,且所有模型与美国的文化对齐度均显著高于中国。
English: This study conducts a large-scale evaluation of cultural alignment in large language models (LLMs), revealing that while outputs generally reflect a moderate cultural middle ground, the United States is the best-aligned country and GLM-4 excels in cultural value alignment, with models consistently aligning more closely with the U.S. than China.

Authors:Nicholas Sukiennik, Haoyu Wang, Zailin Zeng, Chen Gao, Yong Li
Title: Simulating Filter Bubble on Short-video Recommender System with Large Language Model Agents
Abstract:
An increasing reliance on recommender systems has led to concerns about the creation of filter bubbles on social media, especially on short video platforms like TikTok. However, their formation is still not entirely understood due to the complex dynamics between recommendation algorithms and user feedback. In this paper, we aim to shed light on these dynamics using a large language model-based simulation framework. Our work employs real-world short-video data containing rich video content information and detailed user-agents to realistically simulate the recommendation-feedback cycle. Through large-scale simulations, we demonstrate that LLMs can replicate real-world user-recommender interactions, uncovering key mechanisms driving filter bubble formation. We identify critical factors, such as demographic features and category attraction that exacerbate content homogenization. To mitigate this, we design and test interventions including various cold-start and feedback weighting strategies, showing measurable reductions in filter bubble effects. Our framework enables rapid prototyping of recommendation strategies, offering actionable solutions to enhance content diversity in real-world systems. Furthermore, we analyze how LLM-inherent biases may propagate through recommendations, proposing safeguards to promote equity for vulnerable groups, such as women and low-income populations. By examining the interplay between recommendation and LLM agents, this work advances a deeper understanding of algorithmic bias and provides practical tools to promote inclusive digital spaces.
中文: 本研究采用基于大语言模型的仿真框架,揭示短视频平台推荐系统与用户互动如何形成信息茧层,识别出加剧内容同质化的关键因素,并通过测试冷启动和反馈加权等干预策略有效削弱茧层效应,同时提出防范算法偏见以促进数字空间公平性。
English: This study uses a large language model-based simulation to analyze how recommender systems and user interactions create filter bubbles on short video platforms, identifying key contributing factors and testing interventions that effectively reduce content homogenization while addressing inherent biases to promote equity.

Authors:Mingqing Zhang, Qiang Liu, Xiang Tao, Shu Wu, Liang Wang
Title: SINCon: Mitigate LLM-Generated Malicious Message Injection Attack for Rumor Detection
Abstract:
In the era of rapidly evolving large language models (LLMs), state-of-the-art rumor detection systems, particularly those based on Message Propagation Trees (MPTs), which represent a conversation tree with the post as its root and the replies as its descendants, are facing increasing threats from adversarial attacks that leverage LLMs to generate and inject malicious messages. Existing methods are based on the assumption that different nodes exhibit varying degrees of influence on predictions. They define nodes with high predictive influence as important nodes and target them for attacks. If the model treats nodes' predictive influence more uniformly, attackers will find it harder to target high predictive influence nodes. In this paper, we propose Similarizing the predictive Influence of Nodes with Contrastive Learning (SINCon), a defense mechanism that encourages the model to learn graph representations where nodes with varying importance have a more uniform influence on predictions. Extensive experiments on the Twitter and Weibo datasets demonstrate that SINCon not only preserves high classification accuracy on clean data but also significantly enhances resistance against LLM-driven message injection attacks.
中文摘要:针对利用大语言模型攻击谣言检测系统中关键节点的对抗性威胁,本文提出SINCon防御机制,通过对比学习均衡节点预测影响力,在保持检测精度的同时显著提升抗攻击能力。
English Summary: In response to adversarial attacks using large language models to target key nodes in rumor detection systems, this paper introduces SINCon, a defense mechanism that equalizes node influence through contrastive learning to maintain accuracy and improve robustness against such attacks.

Authors:Liuji Chen, Hao Gao, Jinghao Zhang, Qiang Liu, Shu Wu, Liang Wang
Title: Select Me! When You Need a Tool: A Black-box Text Attack on Tool Selection
Abstract:
Tool learning serves as a powerful auxiliary mechanism that extends the capabilities of large language models (LLMs), enabling them to tackle complex tasks requiring real-time relevance or high precision operations. Behind its powerful capabilities lie some potential security issues. However, previous work has primarily focused on how to make the output of the invoked tools incorrect or malicious, with little attention given to the manipulation of tool selection. To fill this gap, we introduce, for the first time, a black-box text-based attack that can significantly increase the probability of the target tool being selected in this paper. We propose a two-level text perturbation attack witha coarse-to-fine granularity, attacking the text at both the word level and the character level. We conduct comprehensive experiments that demonstrate the attacker only needs to make some perturbations to the tool's textual information to significantly increase the possibility of the target tool being selected and ranked higher among the candidate tools. Our research reveals the vulnerability of the tool selection process and paves the way for future research on protecting this process.
中文: 本文首次提出一种基于文本的黑盒攻击方法,通过从粗到细的粒度扰动操纵大语言模型的工具选择过程,揭示了该环节的脆弱性并为后续防护研究指明了方向。
English: This paper introduces a black-box text-based attack that manipulates tool selection in large language models through coarse-to-fine granularity perturbations, revealing vulnerabilities in the selection process and calling for future protective measures.

Authors:Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li
Title: The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
Abstract:
3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial reasoning remains under-explored. In this work, we comprehensively evaluate and analyze these models to answer the research question: \textit{Does point cloud truly boost the spatial reasoning capacities of 3D LLMs?} We first evaluate the spatial reasoning capacity of LLMs with different input modalities by replacing the point cloud with the visual and text counterparts. We then propose a novel 3D QA (Question-answering) benchmark, ScanReQA, that comprehensively evaluates models' understanding of binary spatial relationships. Our findings reveal several critical insights: 1) LLMs without point input could even achieve competitive performance even in a zero-shot manner; 2) existing 3D LLMs struggle to comprehend the binary spatial relationships; 3) 3D LLMs exhibit limitations in exploiting the structural coordinates in point clouds for fine-grained spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and reproducible codes in the anonymous project page: https://3d-llm.xyz.
Chinese: 本研究评估了利用点云进行空间推理的3D大语言模型,发现无点输入模型仍具竞争力,且现有3D大模型在二元空间关系和细粒度推理方面存在明显局限。
English: This study evaluates 3D LLMs using point clouds for spatial reasoning, revealing that models without point inputs can perform competitively and existing 3D LLMs struggle with binary spatial relationships and fine-grained reasoning.

Authors:Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
Abstract:
Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. Recent approaches have investigated training-free strategies to enable high-resolution image synthesis with pre-trained models. However, these techniques often struggle with generating high-quality visuals and tend to exhibit artifacts or low-fidelity details, as they typically rely solely on the endpoint of the low-resolution sampling trajectory while neglecting intermediate states that are critical for preserving structure and synthesizing finer detail. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging such flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow's capability in achieving superior high-resolution image quality over state-of-the-art methods.
中文: HiFlow是一种无需训练的框架,通过初始化、方向和加速三个维度的流对齐指导,有效提升文本到图像模型的高分辨率生成质量,在保持结构的同时增强细节保真度。
English: HiFlow is a training-free framework that enhances high-resolution image synthesis in text-to-image models by aligning low-resolution flow information across initialization, direction, and acceleration to preserve structure and improve detail fidelity.

Authors:Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation
Abstract:
Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models.Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (https://amazingren.github.io/AnyIR/)
中文:AnyIR提出了一种统一的图像修复模型,通过联合嵌入和空频并行融合策略利用退化相似性,在不依赖大型语言模型的情况下实现了最先进的性能,同时将参数量减少约82%、计算量降低85%。
English: AnyIR introduces a unified image restoration model that leverages degradation similarities through joint embeddings and spatial-frequency fusion, achieving state-of-the-art performance while reducing parameters by 82% and FLOPs by 85% without relying on large language models.

Authors:Lei Sun, Andrea Alfarano, Peiqi Duan, Shaolin Su, Kaiwei Wang, Boxin Shi, Radu Timofte, Danda Pani Paudel, Luc Van Gool, Qinglin Liu, Wei Yu, Xiaoqian Lv, Lu Yang, Shuigen Wang, Shengping Zhang, Xiangyang Ji, Long Bao, Yuqiang Yang, Jinao Song, Ziyi Wang, Shuang Wen, Heng Sun, Kean Liu, Mingchen Zhong, Senyan Xu, Zhijing Sun, Jiaying Zhu, Chengjie Ge, Xingbo Wang, Yidi Liu, Xin Lu, Xueyang Fu, Zheng-Jun Zha, Dawei Fan, Dafeng Zhang, Yong Yang, Siru Zhang, Qinghua Yang, Hao Kang, Huiyuan Fu, Heng Zhang, Hongyuan Yu, Zhijuan Huang, Shuoyan Wei, Feng Li, Runmin Cong, Weiqi Luo, Mingyun Lin, Chenxu Jiang, Hongyi Liu, Lei Yu, Weilun Li, Jiajun Zhai, Tingting Lin, Shuang Ma, Sai Zhou, Zhanwen Liu, Yang Wang, Eiffel Chong, Nuwan Bandara, Thivya Kandappu, Archan Misra, Yihang Chen, Zhan Li, Weijun Yuan, Wenzhuo Wang, Boyang Yao, Zhanglu Chen, Yijing Sun, Tianjiao Wan, Zijian Gao, Qisheng Xu, Kele Xu, Yukun Zhang, Yu He, Xiaoyan Xie, Tao Fu, Yashu Gautamkumar Patel, Vihar Ramesh Jain, Divesh Basina, Rishik Ashili, Manish Kumar Manjhi, Sourav Kumar, Prinon Benny, Himanshu Ghunawat, B Sri Sairam Gautam, Anett Varghese, Abhishek Yadav
Title: NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results
Abstract:
This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.
中文摘要:本文介绍了NTIRE 2025基于事件图像去模糊的首届挑战赛,15支参赛团队利用事件和图像数据开发去模糊方法,在无计算限制条件下取得了显著成果。
English Summary: This paper introduces the NTIRE 2025 challenge on event-based image deblurring, where 15 teams developed methods using event and image inputs to achieve high-quality results without computational constraints.

Authors:Lei Sun, Yuhan Bao, Jiajun Zhai, Jingyun Liang, Yulun Zhang, Kaiwei Wang, Danda Pani Paudel, Luc Van Gool
Title: Low-Light Image Enhancement using Event-Based Illumination Estimation
Abstract:
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using ''temporal-mapping'' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a 640X480 image.
Chinese: 本文提出RetinEV方法,通过利用时间映射事件估计光照并增强图像反射率,在低光图像增强中实现了更高的动态范围和效率,显著优于现有技术。
English: This paper introduces RetinEV, a novel low-light image enhancement method that leverages temporal-mapping events to estimate illumination and enhance image reflectance, achieving superior dynamic range and efficiency over existing approaches.

Authors:Jia Li, Xianjie Shi, Kechi Zhang, Lei Li, Ge Li, Zhengwei Tao, Jia Li, Fang Liu, Chongyang Tao, Zhi Jin
Title: CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation
Abstract:
Large language models (LLMs) have shown promising performance in automated code generation, especially excelling in simple tasks such as generating standalone codes. Different from simple tasks, real-world code generation usually depends on specific programming environment (e.g., code repositories). It contains complex dependencies and domain knowledge, which is needed for LLMs when generating target code snippets. In this paper, we propose CodeRAG, a retrieval-augmented code generation (RAG) framework to comprehensively retrieve supportive codes for real-world code generation. Beginning with the requirement, CodeRAG first constructs a requirement graph for the current repository, and retrieves sub- and similar- requirement nodes of the target requirement on the graph. Meanwhile, it models the repository into a DS-code graph. CodeRAG then maps these relevant requirement nodes into their corresponding code nodes, and treats these code nodes as archors for LLM reasoning on DS-code graph. Finally, CodeRAG introduces a code-oriented agentic reasoning process, seamlessly allowing LLMs to reason and comprehensively retrieve for supportive codes which LLMs' need for generating correct programs. Experiments show that CodeRAG achieves significant improvements (i.e., increasing 40.90 and 37.79 Pass@1 on GPT-4o and Gemini-Pro on DevEval) compared to no RAG scenarios. Further tests on reasoning LLMs (i.e., QwQ-32B) confirm CodeRAG's adaptability and efficacy across various types of LLMs. In addition, CodeRAG outperforms commercial programming products such as Copilit and Cursor. We further investigate the performance of our framework on different dependency types, and observe that CodeRAG is superior in generating examples where target codes invoke predefined cross-file code snippets. These results demonstrate CodeRAG's potential in solving real-world repo-level coding challenges.
Chinese Summary: CodeRAG是一个检索增强的代码生成框架,通过从代码仓库图中全面检索相关依赖和领域知识来增强大语言模型生成实际代码的能力,相比现有方法实现了显著性能提升。
English Summary: CodeRAG is a retrieval-augmented framework that enhances large language models' ability to generate real-world code by comprehensively retrieving relevant code dependencies and domain knowledge from repository graphs, achieving significant performance improvements over existing methods.

Authors:Yi Zhang, Yiwen Zhang, Yu Wang, Tong Chen, Hongzhi Yin
Title: Towards Distribution Matching between Collaborative and Language Spaces for Generative Recommendation
Abstract:
Generative recommendation aims to learn the underlying generative process over the entire item set to produce recommendations for users. Although it leverages non-linear probabilistic models to surpass the limited modeling capacity of linear factor models, it is often constrained by a trade-off between representation ability and tractability. With the rise of a new generation of generative methods based on pre-trained language models (LMs), incorporating LMs into general recommendation with implicit feedback has gained considerable attention. However, adapting them to generative recommendation remains challenging. The core reason lies in the mismatch between the input-output formats and semantics of generative models and LMs, making it challenging to achieve optimal alignment in the feature space. This work addresses this issue by proposing a model-agnostic generative recommendation framework called DMRec, which introduces a probabilistic meta-network to bridge the outputs of LMs with user interactions, thereby enabling an equivalent probabilistic modeling process. Subsequently, we design three cross-space distribution matching processes aimed at maximizing shared information while preserving the unique semantics of each space and filtering out irrelevant information. We apply DMRec to three different types of generative recommendation methods and conduct extensive experiments on three public datasets. The experimental results demonstrate that DMRec can effectively enhance the recommendation performance of these generative models, and it shows significant advantages over mainstream LM-enhanced recommendation methods.
中文摘要:生成式推荐因生成模型与语言模型的输入输出不匹配而面临挑战,DMRec通过概率元网络和跨空间分布匹配有效解决了这一问题,显著提升了推荐性能。
English Summary: Generative recommendation faces challenges in aligning generative models with language models due to input-output mismatches, which DMRec addresses through a probabilistic meta-network and cross-space distribution matching to enhance performance.

Authors:Yuchuan Zhao, Tong Chen, Junliang Yu, Kai Zheng, Lizhen Cui, Hongzhi Yin
Title: Diversity-aware Dual-promotion Poisoning Attack on Sequential Recommendation
Abstract:
Sequential recommender systems (SRSs) excel in capturing users' dynamic interests, thus playing a key role in various industrial applications. The popularity of SRSs has also driven emerging research on their security aspects, where data poisoning attack for targeted item promotion is a typical example. Existing attack mechanisms primarily focus on increasing the ranks of target items in the recommendation list by injecting carefully crafted interactions (i.e., poisoning sequences), which comes at the cost of demoting users' real preferences. Consequently, noticeable recommendation accuracy drops are observed, restricting the stealthiness of the attack. Additionally, the generated poisoning sequences are prone to substantial repetition of target items, which is a result of the unitary objective of boosting their overall exposure and lack of effective diversity regularizations. Such homogeneity not only compromises the authenticity of these sequences, but also limits the attack effectiveness, as it ignores the opportunity to establish sequential dependencies between the target and many more items in the SRS. To address the issues outlined, we propose a Diversity-aware Dual-promotion Sequential Poisoning attack method named DDSP for SRSs. Specifically, by theoretically revealing the conflict between recommendation and existing attack objectives, we design a revamped attack objective that promotes the target item while maintaining the relevance of preferred items in a user's ranking list. We further develop a diversity-aware, auto-regressive poisoning sequence generator, where a re-ranking method is in place to sequentially pick the optimal items by integrating diversity constraints.
中文摘要:针对序列推荐系统中数据投毒攻击以牺牲用户偏好和隐蔽性为代价提升目标物品的问题,本文提出DDSP方法,通过设计双重提升目标和多样性感知的序列生成机制实现有效应对。
English Summary: Sequential recommender systems face security threats from data poisoning attacks that promote target items at the expense of user preferences and attack stealthiness, which the proposed DDSP method addresses by introducing a dual-promotion objective and diversity-aware sequence generation.

Authors:Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu
Title: Inference-Time Scaling for Generalist Reward Modeling
Abstract:
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models are released at Hugging Face and ModelScope.
中文摘要:本研究提出自原则批判调优方法,通过自适应原则生成和精确批判来增强通用奖励建模,研发的DeepSeek-GRM模型在多项基准测试中展现出优于现有方法的可扩展性和性能表现。
English Summary: This study introduces Self-Principled Critique Tuning (SPCT) to enhance generalist reward modeling through adaptive principle generation and accurate critiques, resulting in DeepSeek-GRM models that show improved scalability and performance across various benchmarks compared to existing methods.

Authors:Xiuwei Shang, Zhenkan Fu, Shaoyin Cheng, Guoqiang Chen, Gangyang Li, Li Hu, Weiming Zhang, Nenghai Yu
Title: An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding
Abstract:
Binary code analysis plays a pivotal role in the field of software security and is widely used in tasks such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, reverse engineers face significant challenges in understanding binary code due to the lack of intuitive semantic information. Although traditional reverse tools can convert binary code into C-like pseudo code, the lack of code comments and symbolic information such as function names still makes code understanding difficult. In recent years, two groups of techniques have shown promising prospects: (1) Deep learning-based techniques have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This has left participants wondering about the capabilities of LLMs in binary code understanding. To this end, this work proposes a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios, which covers two key binary code understanding tasks, i.e., function name recovery and binary code summarization. To more comprehensively evaluate, we include binaries with multiple target architectures as well as different optimization options. We gain valuable insights into the capabilities and limitations through extensive empirical studies of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding, and provide new directions for binary code analysis techniques.
中文: 二进制代码分析在软件安全中至关重要但缺乏语义信息而极具挑战,本研究提出一个基准来评估大语言模型在实际逆向工程中的能力,发现其虽存在局限却展现出提升二进制代码理解的巨大潜力。
English: Binary code analysis is crucial for software security but challenging due to missing semantic details, and this study introduces a benchmark to assess large language models' effectiveness in real-world reverse engineering tasks, revealing their potential to enhance binary code understanding despite current limitations.

Authors:Xin Zhang, Kejiang Chen, Na Zhao, Weiming Zhang, Nenghai Yu
Title: Provably Secure Public-Key Steganography Based on Admissible Encoding
Abstract:
The technique of hiding secret messages within seemingly harmless covertext to evade examination by censors with rigorous security proofs is known as provably secure steganography (PSS). PSS evolves from symmetric key steganography to public-key steganography, functioning without the requirement of a pre-shared key and enabling the extension to multi-party covert communication and identity verification mechanisms. Recently, a public-key steganography method based on elliptic curves was proposed, which uses point compression to eliminate the algebraic structure of curve points. However, this method has strict requirements on the curve parameters and is only available on half of the points. To overcome these limitations, this paper proposes a more general elliptic curve public key steganography method based on admissible encoding. By applying the tensor square function to the known well-distributed encoding, we construct admissible encoding, which can create the pseudo-random public-key encryption function. The theoretical analysis and experimental results show that the proposed provable secure public-key steganography method can be deployed on all types of curves and utilize all points on the curve.
中文摘要:本文提出了一种基于可容许编码的通用椭圆曲线公钥隐写方法,通过构建伪随机公钥加密函数,解决了以往方法对曲线参数和点使用的限制,实现了在所有类型曲线上利用全部点的可证明安全隐写。
English Summary: This paper introduces a more versatile elliptic curve-based public key steganography method using admissible encoding, which overcomes previous limitations by working with all curve types and utilizing all curve points for provably secure covert communication.

Authors:Xiangkun Wang, Kejiang Chen, Yuang Qi, Ruiheng Liu, Weiming Zhang, Nenghai Yu
Title: GIFDL: Generated Image Fluctuation Distortion Learning for Enhancing Steganographic Security
Abstract:
Minimum distortion steganography is currently the mainstream method for modification-based steganography. A key issue in this method is how to define steganographic distortion. With the rapid development of deep learning technology, the definition of distortion has evolved from manual design to deep learning design. Concurrently, rapid advancements in image generation have made generated images viable as cover media. However, existing distortion design methods based on machine learning do not fully leverage the advantages of generated cover media, resulting in suboptimal security performance. To address this issue, we propose GIFDL (Generated Image Fluctuation Distortion Learning), a steganographic distortion learning method based on the fluctuations in generated images. Inspired by the idea of natural steganography, we take a series of highly similar fluctuation images as the input to the steganographic distortion generator and introduce a new GAN training strategy to disguise stego images as fluctuation images. Experimental results demonstrate that GIFDL, compared with state-of-the-art GAN-based distortion learning methods, exhibits superior resistance to steganalysis, increasing the detection error rates by an average of 3.30% across three steganalyzers.
Chinese: 提出的GIFDL方法利用生成图像的波动特性来设计隐写失真,通过新型GAN训练策略将隐写图像伪装为波动图像,相比现有方法使三种隐写分析器的平均检测错误率提高了3.30%,显著提升了安全性。
English: The proposed GIFDL method leverages fluctuations in generated images to design steganographic distortion, employing a novel GAN training strategy that enhances security by increasing steganalysis detection error rates by an average of 3.30%.

Authors:Zijin Yang, Xin Zhang, Kejiang Chen, Kai Zeng, Qiyi Yao, Han Fang, Weiming Zhang, Nenghai Yu
Title: Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models
Abstract:
Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.
中文摘要:高斯着色++方法通过双通道伪随机纠错编码和软判决解码设计,解决了实际部署中的水印密钥管理、用户参数变化和第三方验证难题,在保持性能无损的同时显著提升了水印鲁棒性和防伪能力。
English Summary: The proposed Gaussian Shading++ method addresses real-world diffusion model watermarking challenges by implementing double-channel encoding with pseudorandom error-correcting codes and soft decision decoding, ensuring performance-lossless watermarking, robust verification under varying parameters, and third-party verification through public key signatures.

Authors:ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen, Riwei Chen, Liangqiang Chen, Zixin Chen, Jinsong Chen, Siyan Chen, Kaiyuan Chen, Zhi Chen, Jin Chen, Jiecao Chen, Jinxin Chi, Weinan Dai, Ning Dai, Jiahui Dai, Shihan Dou, Yantao Du, Zhengyin Du, Jianhui Duan, Chen Dun, Ting-Han Fan, Jiazhan Feng, Junda Feng, Ziyuan Feng, Yuwei Fu, Wenqi Fu, Hanjie Fu, Hao Ge, Hongyi Guo, Mingji Han, Li Han, Wenhao Hao, Xintong Hao, Qianyu He, Jerry He, Feng He, Wen Heng, Zehua Hong, Qi Hou, Liang Hu, Shengding Hu, Nan Hu, Kai Hua, Qi Huang, Ziyue Huang, Hongzhi Huang, Zihao Huang, Ting Huang, Wenhao Huang, Wei Jia, Bin Jia, Xiaoying Jia, Yuhua Jiang, Haobin Jiang, Ziheng Jiang, Kaihua Jiang, Chengquan Jiang, Jianpeng Jiao, Xiaoran Jin, Xing Jin, Xunhao Lai, Zheng Li, Xiang Li, Liyi Li, Hongkai Li, Zheng Li, Shengxian Wan, Ya Wang, Yunshui Li, Chenggang Li, Niuniu Li, Siyu Li, Xi Li, Xiao Li, Aoyan Li, Yuntao Li, Nianning Liang, Xinnian Liang, Haibin Lin, Weijian Lin, Ye Lin, Zhicheng Liu, Guanlin Liu, Guanlin Liu, Chenxiao Liu, Yan Liu, Gaohong Liu, Juncai Liu, Chundian Liu, Deyi Liu, Kaibo Liu, Siyao Liu, Qi Liu, Yongfei Liu, Kang Liu, Gan Liu, Boyi Liu, Rui Long, Weiqiang Lou, Chenwei Lou, Xiang Luo, Yao Luo, Caiping Lv, Heyang Lv, Bole Ma, Qianli Ma, Hongzhi Ma, Yiyuan Ma, Jin Ma, Wenchang Ma, Tingting Ma, Chen Mao, Qiyang Min, Zhe Nan, Guanghan Ning, Jinxiang Ou, Haojie Pan, Renming Pang, Yanghua Peng, Tao Peng, Lihua Qian, Lihua Qian, Mu Qiao, Meng Qu, Cheng Ren, Hongbin Ren, Yong Shan, Wei Shen, Ke Shen, Kai Shen, Guangming Sheng, Jinlong Shi, Wenlei Shi, Guang Shi, Shuai Shuai Cao, Yuxin Song, Zuquan Song, Jing Su, Yifan Sun, Tao Sun, Zewei Sun, Borui Wan, Zihan Wang, Xiaohui Wang, Xi Wang, Shuguang Wang, Jun Wang, Qinlong Wang, Chenyuan Wang, Shuai Wang, Zihan Wang, Changbao Wang, Jiaqiang Wang, Shihang Wang, Xuwu Wang, Zaiyuan Wang, Yuxuan Wang, Wenqi Wang, Taiqing Wang, Chengzhi Wei, Houmin Wei, Ziyun Wei, Shufa Wei, Zheng Wu, Yonghui Wu, Yangjun Wu, Bohong Wu, Shuang Wu, Jingqiao Wu, Ning Wu, Shuangzhi Wu, Jianmin Wu, Chenguang Xi, Fan Xia, Yuqiao Xian, Liang Xiang, Boren Xiang, Bowen Xiao, Zhen Xiao, Xia Xiao, Yongsheng Xiao, Chao Xin, Shulin Xin, Yuwen Xiong, Jingjing Xu, Ziwen Xu, Chenyin Xu, Jiayi Xu, Yifan Xu, Wei Xu, Yufei Xu, Shikun Xu, Shipeng Yan, Shen Yan, Qingping Yang, Xi Yang, Tianhao Yang, Yuehang Yang, Yuan Yang, Ximing Yang, Zeyu Yang, Guang Yang, Yifan Yang, Xuesong Yao, Bairen Yi, Fan Yin, Jianian Yin, Ziqiang Ying, Xiangyu Yu, Hongli Yu, Song Yu, Menghan Yu, Huan Yu, Siyu Yuan, Jun Yuan, Yutao Zeng, Tianyang Zhan, Zheng Zhang, Yun Zhang, Mofan Zhang, Wang Zhang, Ru Zhang, Zhi Zhang, Tianqi Zhang, Xinyi Zhang, Zhexi Zhang, Sijun Zhang, Wenqiang Zhang, Xiangxiang Zhang, Yongtao Zhang, Yuyu Zhang, Ge Zhang, He Zhang, Yue Zhang, Renjie Zheng, Ningxin Zheng, Zhuolin Zheng, Yaowei Zheng, Chen Zheng, Xiaoyun Zhi, Wanjun Zhong, Cheng Zhong, Zheng Zhong, Baoquan Zhong, Xun Zhou, Na Zhou, Huan Zhou, Hang Zhu, Defa Zhu, Wenjia Zhu, Lei Zuo
Title: Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
中文: Seed1.5-Thinking 是一款紧凑型专家混合模型,通过先思考后应答的机制显著提升推理能力,在STEM领域、编程及多类基准测试中表现卓越,并展现出优于现有模型的泛化性能。
English: Seed1.5-Thinking is a compact Mixture-of-Experts model that enhances reasoning capabilities through pre-response deliberation, achieving top-tier results across STEM, coding, and diverse benchmarks while demonstrating superior generalization over existing models.

Authors:Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh
Title: NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Abstract:
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective data augmentation method that mixes trajectories from both clean and moderately distorted images during RL training. By injecting targeted diversity in visual perception and the resulting reasoning patterns, NoisyRollout promotes better policy exploration through vision-oriented inductive biases, ultimately leading to more robust reasoning behaviors. We further adopt a noise annealing schedule that gradually reduces distortion strength over training, leveraging noisy signals early on while ensuring training stability in later stages. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on $2$ distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across $5$ out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes ($7$B and $32$B) and data scales (from $1$K to $6$K), highlighting its generalizability and scalability.
中文: 提出的NoisyRollout方法通过在强化学习训练中融合清晰与失真图像轨迹,增强了视觉语言模型的推理鲁棒性,无需额外训练成本即可在多个基准测试中实现最优性能。
English: The proposed NoisyRollout method enhances vision-language models' reasoning robustness by incorporating both clean and distorted image trajectories during reinforcement learning training, achieving state-of-the-art performance across multiple benchmarks without additional training costs.

Authors:Shizhan Cai, Liang Ding, Dacheng Tao
Title: Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
Abstract:
The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80\% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.
Chinese: 本文提出了一种新颖的大语言模型水印方案,通过引入累积水印熵阈值,在提高检测能力的同时保持了文本质量,在MATH和GSM8K等数据集上显著优于现有方法。
English: This paper introduces a novel watermarking scheme for large language models that enhances both detectability and text quality by using a cumulative watermark entropy threshold, significantly outperforming existing methods on datasets like MATH and GSM8K.

Authors:Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, Xipeng Qiu
Title: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Abstract:
Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
Chinese: VisuoThink是一种创新框架,通过融合视觉空间与语言推理,利用多模态慢思考和测试时扩展来增强复杂问题解决能力,在几何和空间推理任务中实现了最先进的性能。
English: VisuoThink is a novel framework that integrates visuospatial and linguistic reasoning to enhance complex problem-solving through multimodal slow thinking and test-time scaling, achieving state-of-the-art results in geometry and spatial tasks.

Authors:Wei-Jer Chang, Wei Zhan, Masayoshi Tomizuka, Manmohan Chandraker, Francesco Pittaluga
Title: LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation
Abstract:
Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.
中文: LangTraj是一种语言条件场景扩散模型,通过自然语言输入实现对交通场景的灵活控制,在自动驾驶车辆测试中展现出优越的真实性和安全关键场景模拟能力。
English: LangTraj is a language-conditioned scene-diffusion model that enables flexible control over traffic scenarios through natural language inputs, enhancing autonomous vehicle testing with improved realism and safety-critical simulation capabilities.

Authors:Chengyuan Liu, Shihang Wang, Lizhi Qing, Kaisong Song, Junjie Cao, Jun Lin, Ji Zhang, Ang Li, Kun Kuang, Fei Wu
Title: Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement
Abstract:
Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.
中文: 近期研究通过蒙特卡洛树搜索对思维链进行逐步监督,提升了编程和数学等逻辑推理任务的表现,现将其扩展至领域专业知识,提出一种利用专业知识和迭代自我反思优化推理的框架,并在法律领域问题中通过实验验证了其有效性。
English: Recent research enhances logical reasoning in tasks like coding and math through stepwise supervision on Chain of Thoughts using Monte Carlo Tree Search, and now extends this approach to domain-specific expertise by proposing a framework that optimizes reasoning with specialized knowledge and introduces iterative self-reflection for improved outcomes, validated through experiments in legal-domain problems.

Authors:Jie Xu, Yongxin Ma, Yixuan Li, Xuanxuan Zhang, Jun Zhou, Shenghai Yuan, Lihua Xie
Title: Dynamic Initialization for LiDAR-inertial SLAM
Abstract:
The accuracy of the initial state, including initial velocity, gravity direction, and IMU biases, is critical for the initialization of LiDAR-inertial SLAM systems. Inaccurate initial values can reduce initialization speed or lead to failure. When the system faces urgent tasks, robust and fast initialization is required while the robot is moving, such as during the swift assessment of rescue environments after natural disasters, bomb disposal, and restarting LiDAR-inertial SLAM in rescue missions. However, existing initialization methods usually require the platform to remain stationary, which is ineffective when the robot is in motion. To address this issue, this paper introduces a robust and fast dynamic initialization method for LiDAR-inertial systems (D-LI-Init). This method iteratively aligns LiDAR-based odometry with IMU measurements to achieve system initialization. To enhance the reliability of the LiDAR odometry module, the LiDAR and gyroscope are tightly integrated within the ESIKF framework. The gyroscope compensates for rotational distortion in the point cloud. Translational distortion compensation occurs during the iterative update phase, resulting in the output of LiDAR-gyroscope odometry. The proposed method can initialize the system no matter the robot is moving or stationary. Experiments on public datasets and real-world environments demonstrate that the D-LI-Init algorithm can effectively serve various platforms, including vehicles, handheld devices, and UAVs. D-LI-Init completes dynamic initialization regardless of specific motion patterns. To benefit the research community, we have open-sourced our code and test datasets on GitHub.
中文: 本文提出的D-LI-Init动态初始化方法通过迭代对齐激光雷达里程计与IMU测量,实现了机器人在运动状态下的激光雷达-惯性SLAM系统快速初始化,有效突破了现有方法需保持静止的限制。
English: This paper presents a dynamic initialization method called D-LI-Init that enables robust LiDAR-inertial SLAM initialization during robot motion by iteratively aligning LiDAR odometry with IMU measurements, overcoming the stationary requirement of existing methods.

Authors:Yuan Yuan, Yuheng Zhang, Jingtao Ding, Yong Li
Title: WorldMove, a global open data for human mobility
Abstract:
High-quality human mobility data is crucial for applications such as urban planning, transportation management, and public health, yet its collection is often hindered by privacy concerns and data scarcity-particularly in less-developed regions. To address this challenge, we introduce WorldMove, a large-scale synthetic mobility dataset covering over 1,600 cities across 179 countries and 6 continents. Our method leverages publicly available multi-source data, including gridded population distribution, point-of-interest (POI) maps, and commuting origin-destination (OD) flows-to generate realistic city-scale mobility trajectories using a diffusion-based generative model. The generation process involves defining city boundaries, collecting multi-source input features, and simulating individual-level movements that reflect plausible daily mobility behavior. Comprehensive validation demonstrates that the generated data closely aligns with real-world observations, both in terms of fine-grained individual mobility behavior and city-scale population flows. Alongside the pre-generated datasets, we release the trained model and a complete open-source pipeline, enabling researchers and practitioners to generate custom synthetic mobility data for any city worldwide. This work not only fills critical data gaps, but also lays a global foundation for scalable, privacy-preserving, and inclusive mobility research-empowering data-scarce regions and enabling universal access to human mobility insights.
中文: WorldMove是一个基于扩散生成模型从公开数据源创建的大规模合成移动数据集,通过为全球1600多个城市生成逼真的移动轨迹来解决隐私和数据稀缺问题,验证显示其与真实移动模式高度吻合,并开源了供定制数据生成的完整工具链。
English: WorldMove is a large-scale synthetic mobility dataset generated using a diffusion-based model from public data sources, addressing privacy and scarcity issues by providing realistic mobility trajectories for over 1,600 cities worldwide, with validation showing close alignment to real-world patterns and an open-source pipeline for custom data generation.

Authors:Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
Title: SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Abstract:
We introduce ThinkLite-VL, a family of visual reasoning models that achieve state-of-the-art (SoTA) performance using an order of magnitude fewer training samples, relying purely on reinforcement fine-tuning (RFT) self-improvement without any knowledge distillation. Our central insight is that sample difficulty critically influences RFT effectiveness: appropriately challenging examples can drive substantial reasoning improvements, even in low-data regimes. However, quantifying sample difficulty in a reliable and scalable manner remains non-trivial. To address this, we repurpose Monte Carlo Tree Search (MCTS) to measure sample difficulty via the number of reasoning iterations a vision-language model (VLM) requires to solve each instance. This MCTS-based selection procedure identifies samples that induce deeper reasoning while remaining solvable, allowing us to filter a high-quality subset from 70k open-source examples spanning math, natural image understanding, and chart comprehension. Using this approach, we select just 11k challenging samples for RFT on Qwen2.5-VL-7B-Instruct and 7.5k samples for Qwen2.5-VL-72B-Instruct. The resulting models, ThinkLite-VL-7B and ThinkLite-VL-72B, significantly outperform their respective base models across eight visual reasoning benchmarks. In particular, ThinkLite-VL-7B improves the average performance of Qwen2.5-VL-7B-Instruct by 7\% and surpasses all existing 7B-level models, as well as much larger models such as GPT-4o, O1 and Qwen2.5-VL-72B, achieving a new SoTA score of 75.1 on MathVista. ThinkLite-VL-72B further advances the SoTA frontier, achieving an accuracy of 79.7 on MathVista and an average benchmark improvement of 4.42 over the open-source SOTA. These results demonstrate that MCTS-guided difficulty filtering provides a scalable and effective path toward data-efficient self-improvement in multimodal reasoning.
Chinese: ThinkLite-VL模型通过蒙特卡洛树搜索筛选具有适当挑战性的训练样本,仅需极少量数据即可实现视觉推理能力的突破性提升,在多类基准测试中创造了新的性能记录。
English: ThinkLite-VL models achieve state-of-the-art visual reasoning performance with dramatically reduced training data by using Monte Carlo Tree Search to select challenging samples that enhance reinforcement fine-tuning effectiveness.

Authors:Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Title: Enabling Deep Visibility into VxWorks-Based Embedded Controllers in Cyber-Physical Systems for Anomaly Detection
Abstract:
We propose the DIVER (Defensive Implant for Visibility into Embedded Run-times) framework for real-time deep visibility into embedded control devices in cyber-physical systems (CPSs). DIVER enables run-time detection of anomalies and targets devices running VxWorks real-time operating system (RTOS), precluding traditional methods of implementing dynamic monitors using OS (e.g., Linux, Windows) functions. DIVER has two components: "measurer" implant embedded into VxWorks kernel to collect run-time measurements and provide interactive/streaming interfaces over TCP/IP; remote "listener" that acquires and analyzes measurements and provides interactive user interface. DIVER focuses on small embedded devices with stringent resource constraints (e.g., insufficient storage to locally store measurements). To show efficacy and scalability of DIVER, we demonstrate on two embedded devices with different processor architectures and VxWorks versions: Motorola ACE Remote Terminal Unit used in CPS including power systems and Raspberry Pi representative of Internet-of-Things (IoT) applications.
中文: DIVER框架通过在VxWorks内核中嵌入测量器植入物进行实时异常检测,并结合远程监听器进行分析,为资源受限的小型嵌入式设备提供了对信息物理系统中嵌入式控制设备的深度实时可见性。
English: The DIVER framework provides real-time deep visibility into embedded control devices in cyber-physical systems by embedding a measurer implant in the VxWorks kernel for runtime anomaly detection and a remote listener for analysis, effectively addressing resource constraints in small embedded devices.

Authors:Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
Title: Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy
Abstract:
Generalizability, the capacity of a robust model to perform effectively on unseen data, is crucial for audio deepfake detection due to the rapid evolution of text-to-speech (TTS) and voice conversion (VC) technologies. A promising approach to differentiate between bonafide and spoof samples lies in identifying intrinsic disparities to enhance model generalizability. From an information-theoretic perspective, we hypothesize the information content is one of the intrinsic differences: bonafide sample represents a dense, information-rich sampling of the real world, whereas spoof sample is typically derived from lower-dimensional, less informative representations. To implement this, we introduce frame-level latent information entropy detector(f-InfoED), a framework that extracts distinctive information entropy from latent representations at the frame level to identify audio deepfakes. Furthermore, we present AdaLAM, which extends large pre-trained audio models with trainable adapters for enhanced feature extraction. To facilitate comprehensive evaluation, the audio deepfake forensics 2024 (ADFF 2024) dataset was built by the latest TTS and VC methods. Extensive experiments demonstrate that our proposed approach achieves state-of-the-art performance and exhibits remarkable generalization capabilities. Further analytical studies confirms the efficacy of AdaLAM in extracting discriminative audio features and f-InfoED in leveraging latent entropy information for more generalized deepfake detection.
中文: 本研究提出帧级潜在信息熵检测器(f-InfoED)和AdaLAM适配器,通过利用真实与伪造音频在信息含量上的本质差异,在ADFF 2024数据集上实现了最先进的泛化检测性能。
English: The study introduces a frame-level latent information entropy detector (f-InfoED) and AdaLAM adapter to enhance audio deepfake detection by exploiting intrinsic information differences between bonafide and spoof samples, achieving state-of-the-art generalization on the ADFF 2024 dataset.

Authors:Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Title: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Abstract:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
中文:InternVL3采用统一的多模态预训练范式,同步发展多模态与语言能力,在广泛任务中实现顶尖性能,并与主流专有模型保持竞争力。
English: InternVL3 introduces a unified multimodal pre-training paradigm that jointly develops multimodal and linguistic capabilities, achieving state-of-the-art performance across diverse tasks while remaining competitive with leading proprietary models.

Authors:Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao
Title: Structure-Accurate Medical Image Translation via Dynamic Frequency Balance and Knowledge Guidance
Abstract:
Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.
中文摘要:本文提出了一种新颖的医学图像合成方法,通过动态频率平衡和知识引导机制增强低频特征并抑制高频噪声,有效减少解剖结构失真,在多项评估中展现出优越性能。
English Summary: This paper introduces a novel medical image synthesis method that uses dynamic frequency balance and knowledge guidance to reduce anatomical distortion by enhancing low-frequency features and suppressing high-frequency noise, achieving superior performance in evaluations.

Authors:Dawei Zhou, Suzhi Gang, Decheng Liu, Tongliang Liu, Nannan Wang, Xinbo Gao
Title: A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation
Abstract:
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields. To alleviate these issues, adversarial noise-based defenses have been enthusiastically studied in recent years. However, ``data-only" methods tend to distort fake samples in the low-level feature space rather than the high-level semantic space, leading to limitations in resisting malicious manipulation. Frontier research has shown that integrating knowledge in deep learning can produce reliable and generalizable solutions. Inspired by these, we propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples. Specifically, in the process of generating adversarial noise, we focus on constructing significant semantic confusions at the domain-specific knowledge level, and exploit a metric closely related to visual perception to replace the general pixel-wise metrics. The generated adversarial noise can actively interfere with the malicious manipulation model by triggering knowledge-guided and perception-related disruptions in the fake samples. To validate the effectiveness of the proposed method, we conduct qualitative and quantitative experiments on human perception and visual quality assessment. The results on two different tasks both show that our defense provides better protection compared to state-of-the-art methods and achieves great generalizability.
中文: 提出的知识引导对抗防御(KGAD)通过引入语义混淆和基于感知的度量,主动干扰恶意篡改模型,在多项任务中展现出优于现有方法的保护效果和泛化能力。
English: The proposed knowledge-guided adversarial defense (KGAD) actively disrupts malicious manipulation models by introducing semantic confusions and perception-based metrics, outperforming existing methods in protection and generalizability across tasks.

Authors:Yiyan Xu, Wuqiang Zheng, Wenjie Wang, Fengbin Zhu, Xinting Hu, Yang Zhang, Fuli Feng, Tat-Seng Chua
Title: DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition
Abstract:
Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.
Chinese: 个性化图像生成面临用户风格与语义意图难以准确融合的挑战,导致引导崩溃,而提出的DRC框架通过解耦表示组合有效解决了这一问题,提升了生成的可控性和效果。
English: Personalized image generation faces challenges in accurately capturing user style and semantic intentions, leading to guidance collapse, which the proposed DRC framework addresses through disentangled representation composition to enhance control and effectiveness.

Authors:Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng
Title: ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Abstract:
Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.
中文摘要:ALMTokenizer是一种新型低码率音频编解码器,通过基于查询的压缩策略和增强的语义建模技术,在音频理解与生成任务中超越了现有方法。
English Summary: ALMTokenizer is a novel low-bitrate audio codec that uses a query-based compression strategy and enhanced semantic modeling techniques to outperform previous methods in audio understanding and generation tasks.

Authors:Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li
Title: $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
Abstract:
Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.
中文: 提出的“掩码与恢复”策略通过整合全局上下文关联和细粒度置信度评分,解决了现有方法中质量不一致和上下文推理利用不足的问题,从而提升了音视频说话人提取的效果。
English: The proposed Mask-And-Recover strategy enhances audio-visual speaker extraction by incorporating global contextual correlations and a fine-grained confidence score to address inconsistent quality and underutilized contextual inference in existing methods.

Authors:Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
Title: Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
Abstract:
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (ICD) framework. To eliminate biases within the instruction-tuning dataset, it is essential to ensure that these biases do not provide any additional information to predict the answers, i.e., the information gain of these biases for predicting the answers needs to be 0. Under this guidance, this framework utilizes a causal intervention-based data rewriting method to automatically and autonomously balance the distribution of instruction-tuning dataset for reducing the information gain. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that ICD can effectively debias LLM to improve its generalizability across different tasks.
中文: 信息增益引导的因果干预去偏(ICD)框架结合因果机制与信息论,通过消除偏见对答案预测的信息增益来自动平衡指令微调数据集,进而在去偏数据上采用监督微调有效提升大语言模型的泛化能力。
English: The proposed Information Gain-guided Causal Intervention Debiasing (ICD) framework combines causal mechanisms with information theory to automatically balance instruction-tuning datasets by eliminating biases' predictive information, thereby enhancing LLMs' generalizability through supervised fine-tuning on debiased data.

Authors:Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei
Title: 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation
Abstract:
3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.
中文: 3DResT框架通过师生一致性采样强化高质量伪标签的利用,并采用质量驱动动态加权机制挖掘低质量伪标签的有效信息,在仅使用1%标注数据时比全监督方法提升8.34个mIoU值。
English: The proposed 3DResT framework addresses inefficient pseudo-label utilization in semi-supervised 3D-RES through Teacher-Student Consistency-Based Sampling to enhance high-quality labels and Quality-Driven Dynamic Weighting to extract value from low-quality labels, achieving an 8.34 mIoU improvement with only 1% labeled data.

Authors:Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Title: AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Abstract:
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
中文: AdaSteer提出了一种自适应激活引导方法,通过基于拒绝法则和危害性法则的动态系数调整模型行为,在多种大语言模型上实现了更优的越狱防御效果,同时保持良性输入的处理能力。
English: AdaSteer introduces an adaptive activation steering method that dynamically adjusts model behavior using input-specific coefficients derived from Rejection Law and Harmfulness Law, achieving superior jailbreak defense while maintaining utility across multiple LLMs.

Authors:Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu
Title: Enhancing Web Agents with Explicit Rollback Mechanisms
Abstract:
With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.
中文: 本研究为网页代理引入了显式回滚机制,使其能够恢复到之前的导航状态,从而在复杂网页环境中提升规划和搜索效率,实验在实时基准测试中验证了其有效性。
English: This study introduces an explicit rollback mechanism for web agents, allowing them to revert to previous navigation states, which enhances planning and search efficiency in complex web environments, as validated by experiments on live benchmarks.

Authors:Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu
Title: WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms
Abstract:
With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.
中文: 本研究为网页代理引入了显式回滚机制,使其能够恢复到之前的导航状态,从而在复杂网页环境中提升规划和搜索效率,实验在实时基准测试中验证了其有效性。
English: This study introduces an explicit rollback mechanism for web agents, allowing them to revert to previous navigation states, which enhances planning and search efficiency in complex web environments, as validated by experiments on live benchmarks.

Authors:Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
Abstract:
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.
中文: DeepMath-103K数据集通过提供高难度、严格去污的大规模数学资源,解决了强化学习中缺乏挑战性和可验证训练数据的问题,使模型在数学领域达到顶尖水平并能够泛化到其他科学领域。
English: The DeepMath-103K dataset addresses the scarcity of challenging and verifiable training data in reinforcement learning by providing a large-scale mathematical resource with high difficulty and rigorous decontamination, enabling models to achieve state-of-the-art results in math and generalize to other scientific domains.

Authors:Weixuan Chen, Qianqian Yang, Shuo Shao, Zhiguo Shi, Jiming Chen, Xuemin, Shen
Title: Can Knowledge Improve Security? A Coding-Enhanced Jamming Approach for Semantic Communication
Abstract:
As semantic communication (SemCom) attracts growing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels has become a critical issue. However, traditional encryption methods often introduce significant additional communication overhead to maintain stability, and conventional learning-based secure SemCom methods typically rely on a channel capacity advantage for the legitimate receiver, which is challenging to guarantee in real-world scenarios. In this paper, we propose a coding-enhanced jamming method that eliminates the need to transmit a secret key by utilizing shared knowledge-potentially part of the training set of the SemCom system-between the legitimate receiver and the transmitter. Specifically, we leverage the shared private knowledge base to generate a set of private digital codebooks in advance using neural network (NN)-based encoders. For each transmission, we encode the transmitted data into digital sequence Y1 and associate Y1 with a sequence randomly picked from the private codebook, denoted as Y2, through superposition coding. Here, Y1 serves as the outer code and Y2 as the inner code. By optimizing the power allocation between the inner and outer codes, the legitimate receiver can reconstruct the transmitted data using successive decoding with the index of Y2 shared, while the eavesdropper' s decoding performance is severely degraded, potentially to the point of random guessing. Experimental results demonstrate that our method achieves comparable security to state-of-the-art approaches while significantly improving the reconstruction performance of the legitimate receiver by more than 1 dB across varying channel signal-to-noise ratios (SNRs) and compression ratios.
中文: 本文提出了一种编码增强的干扰方法,通过利用共享私有知识生成数字码本,使合法接收方能高效解码数据,同时无需传输密钥即可严重削弱窃听者的解码能力。
English: This paper introduces a coding-enhanced jamming method for secure semantic communication that leverages shared private knowledge to generate digital codebooks, enabling legitimate receivers to decode data efficiently while degrading eavesdropper performance without transmitting secret keys.

Authors:Luankang Zhang, Kenan Song, Yi Quan Lee, Wei Guo, Hao Wang, Yawen Li, Huifeng Guo, Yong Liu, Defu Lian, Enhong Chen
Title: Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model
Abstract:
In recommendation systems, the traditional multi-stage paradigm, which includes retrieval and ranking, often suffers from information loss between stages and diminishes performance. Recent advances in generative models, inspired by natural language processing, suggest the potential for unifying these stages to mitigate such loss. This paper presents the Unified Generative Recommendation Framework (UniGRF), a novel approach that integrates retrieval and ranking into a single generative model. By treating both stages as sequence generation tasks, UniGRF enables sufficient information sharing without additional computational costs, while remaining model-agnostic. To enhance inter-stage collaboration, UniGRF introduces a ranking-driven enhancer module that leverages the precision of the ranking stage to refine retrieval processes, creating an enhancement loop. Besides, a gradient-guided adaptive weighter is incorporated to dynamically balance the optimization of retrieval and ranking, ensuring synchronized performance improvements. Extensive experiments demonstrate that UniGRF significantly outperforms existing models on benchmark datasets, confirming its effectiveness in facilitating information transfer. Ablation studies and further experiments reveal that UniGRF not only promotes efficient collaboration between stages but also achieves synchronized optimization. UniGRF provides an effective, scalable, and compatible framework for generative recommendation systems.
Chinese Summary: 本文提出统一生成推荐框架(UniGRF),通过将检索和排序整合为单一生成模型,实现阶段间充分信息共享与协同优化,在基准数据集上显著超越现有模型性能。
English Summary: This paper introduces the Unified Generative Recommendation Framework (UniGRF), which integrates retrieval and ranking into a single generative model to enhance information sharing and collaborative optimization between stages, achieving superior performance on benchmark datasets.

Authors:Chen Xu, Jujia Zhao, Wenjie Wang, Liang Pang, Jun Xu, Tat-Seng Chua, Maarten de Rijke
Title: Understanding Accuracy-Fairness Trade-offs in Re-ranking through Elasticity in Economics
Abstract:
Fairness is an increasingly important factor in re-ranking tasks. Prior work has identified a trade-off between ranking accuracy and item fairness. However, the underlying mechanisms are still not fully understood. An analogy can be drawn between re-ranking and the dynamics of economic transactions. The accuracy-fairness trade-off parallels the coupling of the commodity tax transfer process. Fairness considerations in re-ranking, similar to a commodity tax on suppliers, ultimately translate into a cost passed on to consumers. Analogously, item-side fairness constraints result in a decline in user-side accuracy. In economics, the extent to which commodity tax on the supplier (item fairness) transfers to commodity tax on users (accuracy loss) is formalized using the notion of elasticity. The re-ranking fairness-accuracy trade-off is similarly governed by the elasticity of utility between item groups. This insight underscores the limitations of current fair re-ranking evaluations, which often rely solely on a single fairness metric, hindering comprehensive assessment of fair re-ranking algorithms. Centered around the concept of elasticity, this work presents two significant contributions. We introduce the Elastic Fairness Curve (EF-Curve) as an evaluation framework. This framework enables a comparative analysis of algorithm performance across different elasticity levels, facilitating the selection of the most suitable approach. Furthermore, we propose ElasticRank, a fair re-ranking algorithm that employs elasticity calculations to adjust inter-item distances within a curved space. Experiments on three widely used ranking datasets demonstrate its effectiveness and efficiency.
中文: 本研究借鉴经济弹性概念,提出弹性公平曲线评估框架和ElasticRank算法,通过分析重排序系统中公平性与准确性的权衡关系并进行优化,在三个数据集上验证了其有效性。
English: This study introduces the Elastic Fairness Curve (EF-Curve) evaluation framework and ElasticRank algorithm, using economic elasticity concepts to analyze and optimize the fairness-accuracy trade-off in re-ranking systems, with experimental validation across three datasets.

Authors:Wei Wang, Nan Cheng, Conghao Zhou, Haixia Peng, Haibo Zhou, Zhou Su, Xuemin, Shen
Title: An Enhanced Dual-Currency VCG Auction Mechanism for Resource Allocation in IoV: A Value of Information Perspective
Abstract:
The Internet of Vehicles (IoV) is undergoing a transformative evolution, enabled by advancements in future 6G network technologies, to support intelligent, highly reliable, and low-latency vehicular services. However, the enhanced capabilities of loV have heightened the demands for efficient network resource allocation while simultaneously giving rise to diverse vehicular service requirements. For network service providers (NSPs), meeting the customized resource-slicing requirements of vehicle service providers (VSPs) while maximizing social welfare has become a significant challenge. This paper proposes an innovative solution by integrating a mean-field multi-agent reinforcement learning (MFMARL) framework with an enhanced Vickrey-Clarke-Groves (VCG) auction mechanism to address the problem of social welfare maximization under the condition of unknown VSP utility functions. The core of this solution is introducing the ``value of information" as a novel monetary metric to estimate the expected benefits of VSPs, thereby ensuring the effective execution of the VCG auction mechanism. MFMARL is employed to optimize resource allocation for social welfare maximization while adapting to the intelligent and dynamic requirements of IoV. The proposed enhanced VCG auction mechanism not only protects the privacy of VSPs but also reduces the likelihood of collusion among VSPs, and it is theoretically proven to be dominant-strategy incentive compatible (DSIC). The simulation results demonstrate that, compared to the VCG mechanism implemented using quantization methods, the proposed mechanism exhibits significant advantages in convergence speed, social welfare maximization, and resistance to collusion, providing new insights into resource allocation in intelligent 6G networks.
中文摘要:本文提出一种结合均值场多智能体强化学习的改进VCG拍卖机制,通过引入"信息价值"作为新型货币度量,在保护车辆服务提供商隐私的同时实现6G车联网资源分配的社会福利最大化,并有效防止合谋行为。
English Summary: This paper introduces an enhanced VCG auction mechanism integrated with mean-field multi-agent reinforcement learning to maximize social welfare in 6G-enabled Internet of Vehicles by efficiently allocating network resources while protecting vehicle service providers' privacy and preventing collusion.

Authors:Xiucheng Wang, Zhongsheng Fang, Nan Cheng, Ruijin Sun, Zan Li, Xuemin, Shen
Title: RadioDiff-Inverse: Diffusion Enhanced Bayesian Inverse Estimation for ISAC Radio Map Construction
Abstract:
Radio maps (RMs) are essential for environment-aware communication and sensing, providing location-specific wireless channel information. Existing RM construction methods often rely on precise environmental data and base station (BS) locations, which are not always available in dynamic or privacy-sensitive environments. While sparse measurement techniques reduce data collection, the impact of noise in sparse data on RM accuracy is not well understood. This paper addresses these challenges by formulating RM construction as a Bayesian inverse problem under coarse environmental knowledge and noisy sparse measurements. Although maximum a posteriori (MAP) filtering offers an optimal solution, it requires a precise prior distribution of the RM, which is typically unavailable. To solve this, we propose RadioDiff-Inverse, a diffusion-enhanced Bayesian inverse estimation framework that uses an unconditional generative diffusion model to learn the RM prior. This approach not only reconstructs the spatial distribution of wireless channel features but also enables environmental structure perception, such as building outlines, and location of BS just relay on pathloss, through integrated sensing and communication (ISAC). Remarkably, RadioDiff-Inverse is training-free, leveraging a pre-trained model from Imagenet without task-specific fine-tuning, which significantly reduces the training cost of using generative large model in wireless networks. Experimental results demonstrate that RadioDiff-Inverse achieves state-of-the-art performance in accuracy of RM construction and environmental reconstruction, and robustness against noisy sparse sampling.
中文摘要:本文提出RadioDiff-Inverse这一无需训练的贝叶斯框架,利用扩散模型仅通过稀疏噪声测量即可构建无线电地图,无需精确环境数据,在精度和环境重建方面表现优异。
English Summary: This paper introduces RadioDiff-Inverse, a training-free Bayesian framework that leverages diffusion models to construct radio maps from noisy sparse measurements without requiring precise environmental data, achieving superior accuracy and environmental reconstruction.

Authors:Xiaoyan Zhao, Yang Deng, Wenjie Wang, Hongzhan lin, Hong Cheng, Rui Zhang, See-Kiong Ng, Tat-Seng Chua
Title: Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models
Abstract:
Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.
中文摘要:本研究提出了PerCRS,一种基于大语言模型的个性化对话推荐系统模拟框架,通过实验验证了人格特质如何影响用户交互行为并驱动推荐策略的动态调整。
English Summary: The study introduces PerCRS, an LLM-based personality-aware simulation for conversational recommender systems that demonstrates how personality traits influence user interactions and prompt dynamic adjustments in recommendation strategies.

Authors:Javier Muñoz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez
Title: FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection
Abstract:
Verifying the authenticity of identity documents (IDs) has become a critical challenge for real-life applications such as digital banking, crypto-exchanges, renting, etc. This study focuses on the topic of fake ID detection, covering several limitations in the field. In particular, there are no publicly available data from real IDs for proper research in this area, and most published studies rely on proprietary internal databases that are not available for privacy reasons. In order to advance this critical challenge of real data scarcity that makes it so difficult to advance the technology of machine learning-based fake ID detection, we introduce a new patch-based methodology that trades off privacy and performance, and propose a novel patch-wise approach for privacy-aware fake ID detection: FakeIDet. In our experiments, we explore: i) two levels of anonymization for an ID (i.e., fully- and pseudo-anonymized), and ii) different patch size configurations, varying the amount of sensitive data visible in the patch image. State-of-the-art methods, such as vision transformers and foundation models, are considered as backbones. Our results show that, on an unseen database (DLC-2021), our proposal for fake ID detection achieves 13.91% and 0% EERs at the patch and the whole ID level, showing a good generalization to other databases. In addition to the path-based methodology introduced and the new FakeIDet method based on it, another key contribution of our article is the release of the first publicly available database that contains 48,400 patches from real and fake IDs, called FakeIDet-db, together with the experimental framework.
中文摘要:本研究针对假身份证检测领域的数据稀缺问题,提出了一种平衡隐私与性能的基于图像块的方法FakeIDet,并发布了首个包含48,400个真假身份证图像块的公开数据库FakeIDet-db以推动该领域研究。
English Summary: This study addresses the challenge of fake ID detection by introducing a patch-based method called FakeIDet that balances privacy and performance, along with releasing the first public database of 48,400 real and fake ID patches to overcome data scarcity in the field.

Authors:Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu
Title: A Survey of Interactive Generative Video
Abstract:
Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.
中文: 交互式生成视频(IGV)是一种结合生成与交互功能的技术,应用于游戏、具身人工智能和自动驾驶领域,提出了包含五个核心模块的框架以应对技术挑战并推动未来发展。
English: Interactive Generative Video (IGV) is a technology that creates high-quality, interactive video content for applications in gaming, embodied AI, and autonomous driving, with a proposed framework of five key modules to address current challenges and guide future advancements.

Authors:He Zhu, Quyu Kong, Kechun Xu, Xunlong Xia, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang
Title: Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
Abstract:
Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.
中文摘要:本文提出了一种基于语言指令和多模态数据的3D物体可供性定位新任务,通过LMAffordance3D网络实现了优越性能,即使在未见过的实验场景中仍表现优异。
English Summary: This paper introduces a novel task for grounding 3D object affordance using language instructions and multi-modal data, proposing the LMAffordance3D network that demonstrates superior performance even in unseen scenarios.

Authors:Yunxuan Mao, Rong Xiong, Yue Wang, Yiyi Liao
Title: UnIRe: Unsupervised Instance Decomposition for Dynamic Urban Scene Reconstruction
Abstract:
Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene editing.We propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object templates.Furthermore, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal stability.Experiments on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.
中文: UnIRe是一种基于3D高斯喷洒和4D超点的新方法,能自动将城市场景分解为静态背景和动态实例而无需人工标注,实现了卓越的重建效果并支持灵活的实例级编辑。
English: UnIRe is a novel approach using 3D Gaussian Splatting and 4D superpoints to automatically decompose urban scenes into static backgrounds and dynamic instances without manual annotations, achieving superior reconstruction and enabling flexible instance-level editing.

Authors:Yu Cui, Yujun Cai, Yiwei Wang
Title: Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression
Abstract:
While reasoning large language models (LLMs) demonstrate remarkable performance across various tasks, they also contain notable security vulnerabilities. Recent research has uncovered a "thinking-stopped" vulnerability in DeepSeek-R1, where model-generated reasoning tokens can forcibly interrupt the inference process, resulting in empty responses that compromise LLM-integrated applications. However, existing methods triggering this vulnerability require complex mathematical word problems with long prompts--even exceeding 5,000 tokens. To reduce the token cost and formally define this vulnerability, we propose a novel prompt injection attack named "Reasoning Interruption Attack", based on adaptive token compression. We demonstrate that simple standalone arithmetic tasks can effectively trigger this vulnerability, and the prompts based on such tasks exhibit simpler logical structures than mathematical word problems. We develop a systematic approach to efficiently collect attack prompts and an adaptive token compression framework that utilizes LLMs to automatically compress these prompts. Experiments show our compression framework significantly reduces prompt length while maintaining effective attack capabilities. We further investigate the attack's performance via output prefix and analyze the underlying causes of the vulnerability, providing valuable insights for improving security in reasoning LLMs.
中文: 本研究提出基于自适应令牌压缩的“推理中断攻击”,能有效触发DeepSeek-R1等推理大语言模型的思维停止漏洞,在显著缩短提示长度的同时保持攻击效果,并为提升模型安全性提供了重要洞见。
English: The study introduces a "Reasoning Interruption Attack" using adaptive token compression to efficiently trigger a thinking-stopped vulnerability in reasoning LLMs like DeepSeek-R1, reducing prompt length while maintaining attack effectiveness and offering insights for security improvements.

Authors:Anton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, Weijie Chen, Baobei Xu, Fengyu Sun, Di Xie, Jiang Zhu, Mykola Lavreniuk, Haining Guan, Qun Wu, Yupei Zeng, Chao Lu, Huanran Wang, Guangyuan Zhou, Haotian Zhang, Jianxiong Wang, Qiang Rao, Chunjie Wang, Xiao Liu, Zhiqiang Lou, Hualie Jiang, Yihao Chen, Rui Xu, Minglang Tan, Zihan Qin, Yifan Mao, Jiayang Liu, Jialei Xu, Yifan Yang, Wenbo Zhao, Junjun Jiang, Xianming Liu, Mingshuai Zhao, Anlong Ming, Wu Chen, Feng Xue, Mengying Yu, Shida Gao, Xiangfeng Wang, Gbenga Omotara, Ramy Farag, Jacket Demby, Seyed Mohamad Ali Tousi, Guilherme N DeSouza, Tuan-Anh Yang, Minh-Quang Nguyen, Thien-Phuc Tran, Albert Luginov, Muhammad Shahzad
Title: The Fourth Monocular Depth Estimation Challenge
Abstract:
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
Chinese: 第四届单目深度估计挑战赛优化了评估标准和基线方法,24项提交结果均超越基准,将3D F-分数从22.58%提升至23.05%,其中领先方案主要采用仿射不变预测技术。
English: The fourth Monocular Depth Estimation Challenge enhanced evaluation protocols and baselines, leading to 24 submissions that surpassed benchmarks and improved the 3D F-Score from 22.58% to 23.05%, with top methods utilizing affine-invariant predictions.

Authors:Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, Deqing Yang
Title: BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation
Abstract:
Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld's design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: https://bookworld2025.github.io/.
中文: 本文提出BookWorld系统,通过构建基于书籍的多智能体社会来模拟复杂虚构世界,在保持原著忠实度的同时实现故事生成等应用,并以75.36%的胜率超越现有方法。
English: This paper introduces BookWorld, a system for simulating book-based multi-agent societies that captures intricate fictional elements and enables applications like story generation, outperforming prior methods with a 75.36% win rate while maintaining fidelity to source materials.

Authors:Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
Title: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
中文: 当前RLVR方法未能激发大语言模型全新的推理能力,其表现仍受限于基础模型,凸显了改进强化学习范式的必要性。
English: Current RLVR methods fail to elicit fundamentally new reasoning abilities in LLMs, as their performance remains bounded by the base model's capabilities, highlighting the need for improved RL paradigms.

Authors:Shiguang Wu, Zhaochun Ren, Xin Xin, Jiyuan Yang, Mengqi Zhang, Zhumin Chen, Maarten de Rijke, Pengjie Ren
Title: Constrained Auto-Regressive Decoding Constrains Generative Retrieval
Abstract:
Generative retrieval seeks to replace traditional search index data structures with a single large-scale neural network, offering the potential for improved efficiency and seamless integration with generative large language models. As an end-to-end paradigm, generative retrieval adopts a learned differentiable search index to conduct retrieval by directly generating document identifiers through corpus-specific constrained decoding. The generalization capabilities of generative retrieval on out-of-distribution corpora have gathered significant attention. In this paper, we examine the inherent limitations of constrained auto-regressive generation from two essential perspectives: constraints and beam search. We begin with the Bayes-optimal setting where the generative retrieval model exactly captures the underlying relevance distribution of all possible documents. Then we apply the model to specific corpora by simply adding corpus-specific constraints. Our main findings are two-fold: (i) For the effect of constraints, we derive a lower bound of the error, in terms of the KL divergence between the ground-truth and the model-predicted step-wise marginal distributions. (ii) For the beam search algorithm used during generation, we reveal that the usage of marginal distributions may not be an ideal approach. This paper aims to improve our theoretical understanding of the generalization capabilities of the auto-regressive decoding retrieval paradigm, laying a foundation for its limitations and inspiring future advancements toward more robust and generalizable generative retrieval.
中文摘要:本文通过分析约束条件和束搜索,探讨了生成式检索在泛化到新数据集时的理论局限性,揭示了其固有误差和次优解码策略。
English Summary: This paper analyzes the theoretical limitations of generative retrieval in generalizing to new datasets by examining constraints and beam search, revealing inherent errors and suboptimal decoding strategies.

Authors:Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye
Title: SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Abstract:
Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
中文总结:现有大语言模型安全对齐存在对新型攻击防御不足和过度拒绝良性指令的问题,为此提出的SaRO框架通过两阶段推理优化实现了更有效的安全防护。
English Summary: Current LLM safety alignment struggles with under-generalization against new attacks and over-alignment causing excessive refusals, prompting the proposed SaRO framework that enhances safety through two-stage reasoning optimization.

Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
Title: DSwinIR: Rethinking Window-based Attention for Image Restoration
Abstract:
Image restoration has witnessed significant advancements with the development of deep learning models. Especially Transformer-based models, particularly those leveraging window-based self-attention, have become a dominant force in image restoration. However, their performance is fundamentally constrained by the rigid, non-overlapping window partitioning scheme, which leads to two critical limitations: insufficient feature interaction across window boundaries and content-agnostic receptive fields that cannot adapt to diverse image structures. Existing methods often rely on heuristic patterns to mitigate these issues, rather than addressing the root cause. In this paper, we propose the Deformable Sliding Window Transformer (DSwinIR), a new foundational backbone architecture that systematically overcomes these limitations. At the heart of DSwinIR is the proposed novel Deformable Sliding Window (DSwin) Attention. This mechanism introduces two fundamental innovations. First, it replaces the rigid partitioning with a token-centric sliding window paradigm, ensuring seamless cross-window information flow and effectively eliminating boundary artifacts. Second, it incorporates a content-aware deformable sampling strategy, which allows the attention mechanism to learn data-dependent offsets and dynamically shape its receptive fields to focus on the most informative image regions. This synthesis endows the model with both strong locality-aware inductive biases and powerful, adaptive long-range modeling capabilities. Extensive experiments show that DSwinIR sets a new state-of-the-art across a wide spectrum of image restoration tasks. For instance, in all-in-one restoration, our DSwinIR surpasses the most recent backbone GridFormer by over 0.53 dB on the three-task benchmark and a remarkable 0.86 dB on the five-task benchmark.
Chinese: 可变形滑动窗口Transformer(DSwinIR)通过创新的注意力机制解决了传统窗口划分的局限性,实现了跨窗口无缝交互和内容感知感受野,在多种图像复原任务中取得了最先进的性能。
English: The Deformable Sliding Window Transformer (DSwinIR) introduces a novel attention mechanism that overcomes limitations of rigid window partitioning by enabling seamless cross-window interaction and content-aware receptive fields, achieving state-of-the-art performance across various image restoration tasks.

Authors:Sifan Li, Yujun Cai, Bryan Hooi, Nanyun Peng, Yiwei Wang
Title: Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs
Abstract:
Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic analysis reveals consistent failure patterns: models often interpret drug names literally, overuse common herbs regardless of relevance, and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs also fail to understand the verification task. These findings demonstrate that current LLMs rely primarily on drug names rather than possessing systematic pharmacological knowledge. To address these limitations, we propose a Retrieval Augmented Generation (RAG) approach focused on ingredient names. Experiments across 220 TCM formulations show our method significantly improves accuracy from approximately 50% to 82% in ingredient verification tasks. Our work highlights critical weaknesses in current TCM-specific LLMs and offers a practical solution for enhancing their clinical reliability.
中文: 研究发现当前中医大语言模型常因字面解读和不稳定响应而误判药物成分,但采用检索增强生成方法后,验证准确率从50%显著提升至82%。
English: This study reveals that current Large Language Models for Traditional Chinese Medicine frequently misinterpret drug ingredients due to literal interpretations and inconsistent responses, but implementing a Retrieval Augmented Generation approach significantly boosts verification accuracy from 50% to 82%.

Authors:Zhaochen Wang, Bryan Hooi, Yiwei Wang, Ming-Hsuan Yang, Zi Huang, Yujun Cai
Title: Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models
Abstract:
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.
中文摘要:视觉语言模型在处理ASCII艺术时表现出强烈的文本优先偏向,尽管尝试了多种缓解方法,模型仍持续优先处理文本语义而非视觉模式,这揭示了多模态整合的根本缺陷,需要架构层面的解决方案。
English Summary: Vision-language models exhibit a strong text-priority bias when processing ASCII art, consistently favoring textual semantics over visual patterns despite various mitigation attempts, revealing fundamental flaws in multimodal integration that require architectural solutions.

Authors:Chunxue Xu, Yiwei Wang, Bryan Hooi, Yujun Cai, Songze Li
Title: How does Watermarking Affect Visual Language Models in Document Understanding?
Abstract:
Visual Language Models (VLMs) have become foundational models for document understanding tasks, widely used in the processing of complex multimodal documents across domains such as finance, law, and academia. However, documents often contain noise-like information, such as watermarks, which inevitably leads us to inquire: \emph{Do watermarks degrade the performance of VLMs in document understanding?} To address this, we propose a novel evaluation framework to investigate the effect of visible watermarks on VLMs performance. We takes into account various factors, including different types of document data, the positions of watermarks within documents and variations in watermark content. Our experimental results reveal that VLMs performance can be significantly compromised by watermarks, with performance drop rates reaching up to 36\%. We discover that \emph{scattered} watermarks cause stronger interference than centralized ones, and that \emph{semantic contents} in watermarks creates greater disruption than simple visual occlusion. Through attention mechanism analysis and embedding similarity examination, we find that the performance drops are mainly attributed to that watermarks 1) force widespread attention redistribution, and 2) alter semantic representation in the embedding space. Our research not only highlights significant challenges in deploying VLMs for document understanding, but also provides insights towards developing robust inference mechanisms on watermarked documents.
中文: 本研究提出新型评估框架,揭示可见水印会使视觉语言模型的文档理解性能下降高达36%,主要原因为水印导致注意力重分布和语义表征改变。
English: This study introduces a novel evaluation framework revealing that visible watermarks can degrade Visual Language Models' document understanding performance by up to 36%, primarily through attention redistribution and semantic representation alterations.

Authors:Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Bin Zhao, Xuelong Li
Title: AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Abstract:
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
中文: 本文提出面向航拍图像的视觉定位新任务AerialVG,针对航拍图中目标外观相似、需强化空间推理的特点,构建了专用数据集并设计了结合层次化注意力与关系感知模块的模型,有效解决了传统方法在航拍场景中的局限。
English: This paper introduces AerialVG, a novel visual grounding task for aerial imagery that emphasizes spatial reasoning due to challenges like distinguishing visually similar objects, and proposes a specialized dataset and model with hierarchical attention and relation-aware modules to address these difficulties.

Authors:Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao
Title: AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Abstract:
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
中文: 本文提出面向航拍图像的视觉定位新任务AerialVG,针对航拍图中目标外观相似、需强化空间推理的特点,构建了专用数据集并设计了结合层次化注意力与关系感知模块的模型,有效解决了传统方法在航拍场景中的局限。
English: This paper introduces AerialVG, a novel visual grounding task for aerial imagery that emphasizes spatial reasoning due to challenges like distinguishing visually similar objects, and proposes a specialized dataset and model with hierarchical attention and relation-aware modules to address these difficulties.

Authors:Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, Yueting Zhuang
Title: Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Abstract:
Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought~(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.
Chinese: 近期大语言模型奖励信号的应用虽取得进展,但在多模态领域面临挑战,为此提出SVIP方法,通过自动生成代码训练多维度思维链奖励模型,利用多头注意力机制提升多模态大模型的性能,减少幻觉并增强推理能力。
English: Recent advances in reward signals for LLMs face challenges in multimodal applications, leading to the development of SVIP, an automated method that trains a step-level multi-dimensional CoT reward model using code generation and multi-head attention, which enhances MLLM performance by reducing hallucinations and improving reasoning.

Authors:Xuyang Guo, Zekai Huang, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang
Title: Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models
Abstract:
Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple numerical constraints. In this work, we present T2VCountBench, a specialized benchmark aiming at evaluating the counting capability of SOTA text-to-video models as of 2025. Our benchmark employs rigorous human evaluations to measure the number of generated objects and covers a diverse range of generators, covering both open-source and commercial models. Extensive experiments reveal that all existing models struggle with basic numerical tasks, almost always failing to generate videos with an object count of 9 or fewer. Furthermore, our comprehensive ablation studies explore how factors like video style, temporal dynamics, and multilingual inputs may influence counting performance. We also explore prompt refinement techniques and demonstrate that decomposing the task into smaller subtasks does not easily alleviate these limitations. Our findings highlight important challenges in current text-to-video generation and provide insights for future research aimed at improving adherence to basic numerical constraints.
Chinese: 当前文本到视频生成模型在执行基础计数任务方面存在明显不足,无法可靠生成符合指定物体数量的视频,T2VCountBench评估框架的系统性实验揭示了这一数值遵循能力的关键缺陷。
English: Current text-to-video models struggle with basic counting tasks, failing to reliably generate videos with specified object counts, as revealed by the T2VCountBench evaluation framework developed to assess numerical constraint adherence.

Authors:Yuanqi Yao, Siao Liu, Haoming Song, Delin Qu, Qizhi Chen, Yan Ding, Bin Zhao, Zhigang Wang, Xuelong Li, Dong Wang
Title: Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation
Abstract:
Building a lifelong robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in alleviating catastrophic forgetting problem, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to represent shared primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are appended and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL's superior performance over state-of-the-art methods.
中文摘要:本文提出原始提示学习(PPL)方法,通过两阶段学习方案:先预训练获取可重用的运动感知原始提示,再通过提示优化实现旧技能向新技能的知识迁移,在终身机器人操作任务中展现出超越现有方法的优异性能。
English Summary: The paper introduces Primitive Prompt Learning (PPL), a two-stage method that learns reusable motion-aware primitives during pre-training and efficiently transfers knowledge to new skills through prompt optimization, demonstrating superior performance in lifelong robot manipulation tasks.

Authors:Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, Ziyang Ma, Xiaojiang Peng, Xie Chen, Ya Li, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao
Title: MER 2025: When Affective Computing Meets Large Language Models
Abstract:
MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER2025 centers on the theme "When Affective Computing Meets Large Language Models (LLMs)".We aim to shift the paradigm from traditional categorical frameworks reliant on predefined emotion taxonomies to LLM-driven generative methods, offering innovative solutions for more accurate and reliable emotion understanding. The challenge features four tracks: MER-SEMI focuses on fixed categorical emotion recognition enhanced by semi-supervised learning; MER-FG explores fine-grained emotions, expanding recognition from basic to nuanced emotional states; MER-DES incorporates multimodal cues (beyond emotion words) into predictions to enhance model interpretability; MER-PR investigates whether emotion prediction results can improve personality recognition performance. For the first three tracks, baseline code is available at MERTools, and datasets can be accessed via Hugging Face. For the last track, the dataset and baseline code are available on GitHub.
中文:MER2025以“情感计算与大语言模型融合”为主题,旨在通过生成式方法革新传统情感分类框架,下设四个赛道分别聚焦半监督学习、细粒度情感识别、多模态可解释性及情感预测对人格识别的优化。
English: MER2025 focuses on integrating affective computing with large language models to transition from traditional emotion classification to generative approaches, featuring four specialized tracks that address semi-supervised learning, fine-grained emotions, multimodal interpretability, and personality recognition enhancements.

Authors:Yu Hong, Xiao Cai, Pengpeng Zeng, Shuai Zhang, Jingkuan Song, Lianli Gao, Heng Tao Shen
Title: Towards Generalized and Training-Free Text-Guided Semantic Manipulation
Abstract:
Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
中文: 文本引导的语义操控利用扩散模型实现精准图像编辑,而提出的GTF方法提供了一种无需训练、通用性强的解决方案,支持跨模态的多种语义修改且无需微调。
English: Text-guided semantic manipulation enables targeted image editing using diffusion models, and the proposed GTF method offers a generalized, training-free solution that supports multiple semantic changes across modalities without fine-tuning.

Authors:Xinjie Li, Jing Zhang, Xingyu Zhou, Chao-Kai Wen, Shi Jin
Title: Joint Channel Estimation and Signal Detection for MIMO-OFDM: A Novel Data-Aided Approach with Reduced Computational Overhead
Abstract:
The acquisition of channel state information (CSI) is essential in MIMO-OFDM communication systems. Data-aided enhanced receivers, by incorporating domain knowledge, effectively mitigate performance degradation caused by imperfect CSI, particularly in dynamic wireless environments. However, existing methodologies face notable challenges: they either refine channel estimates within MIMO subsystems separately, which proves ineffective due to deviations from assumptions regarding the time-varying nature of channels, or fully exploit the time-frequency characteristics but incur significantly high computational overhead due to dimensional concatenation. To address these issues, this study introduces a novel data-aided method aimed at reducing complexity, particularly suited for fast-fading scenarios in fifth-generation (5G) and beyond networks. We derive a general form of a data-aided linear minimum mean-square error (LMMSE)-based algorithm, optimized for iterative joint channel estimation and signal detection. Additionally, we propose a computationally efficient alternative to this algorithm, which achieves comparable performance with significantly reduced complexity. Empirical evaluations reveal that our proposed algorithms outperform several state-of-the-art approaches across various MIMO-OFDM configurations, pilot sequence lengths, and in the presence of time variability. Comparative analysis with basis expansion model-based iterative receivers highlights the superiority of our algorithms in achieving an effective trade-off between accuracy and computational complexity.
中文摘要:本研究提出了一种适用于5G快速衰落场景的低复杂度数据辅助方法,通过优化的联合信道估计与信号检测算法,在多种MIMO-OFDM配置下实现了精度与计算复杂度的有效平衡。
English Summary: This study introduces a low-complexity data-aided method for MIMO-OFDM systems that effectively addresses imperfect channel state information in fast-fading 5G environments through optimized joint channel estimation and signal detection algorithms.

Authors:Jiayi Liu, Jiajia Guo, Yiming Cui, Chao-Kai Wen, Shi Jin
Title: AdapCsiNet: Environment-Adaptive CSI Feedback via Scene Graph-Aided Deep Learning
Abstract:
Accurate channel state information (CSI) is critical for realizing the full potential of multiple-antenna wireless communication systems. While deep learning (DL)-based CSI feedback methods have shown promise in reducing feedback overhead, their generalization capability across varying propagation environments remains limited due to their data-driven nature. Existing solutions based on online training improve adaptability but impose significant overhead in terms of data collection and computational resources. In this work, we propose AdapCsiNet, an environment-adaptive DL-based CSI feedback framework that eliminates the need for online training. By integrating environmental information -- represented as a scene graph -- into a hypernetwork-guided CSI reconstruction process, AdapCsiNet dynamically adapts to diverse channel conditions. A two-step training strategy is introduced to ensure baseline reconstruction performance and effective environment-aware adaptation. Simulation results demonstrate that AdapCsiNet achieves up to 46.4% improvement in CSI reconstruction accuracy and matches the performance of online learning methods without incurring additional runtime overhead.
Chinese: AdapCsiNet是一种环境自适应的深度学习CSI反馈框架,通过结合场景图和超网络,无需在线训练即可实现高达46.4%的精度提升。
English: AdapCsiNet is an environment-adaptive deep learning framework that enhances CSI feedback by integrating scene graphs and hypernetworks, achieving up to 46.4% accuracy improvement without online training.

Authors:Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You
Title: DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation
Abstract:
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}. To overcome this inefficiency, we propose \textbf{Dy}namic \textbf{Di}ffusion \textbf{T}ransformer (DyDiT), an architecture that \emph{dynamically} adjusts its computation along both \emph{timestep} and \emph{spatial} dimensions. Specifically, we introduce a \emph{Timestep-wise Dynamic Width} (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a \emph{Spatial-wise Dynamic Token} (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.
中文: 扩散变换器(DiT)因静态推理产生冗余计算而成本高昂,提出的动态扩散变换器(DyDiT)通过在时间步和空间维度动态调整计算,显著提升效率并扩展至多种生成任务。
English: The Diffusion Transformer (DiT) faces high computational costs due to redundant static computations, which the proposed DyDiT model addresses by dynamically adjusting computations across timesteps and spatial regions, significantly improving efficiency while expanding to various generation tasks.

Authors:Mengxuan Wu, Zekai Li, Zhiyuan Liang, Moyang Li, Xuanlei Zhao, Samir Khaki, Zheng Zhu, Xiaojiang Peng, Konstantinos N. Plataniotis, Kai Wang, Wangbo Zhao, Yang You
Title: Dynamic Vision Mamba
Abstract:
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.
中文: 基于Mamba的视觉模型存在空间冗余,我们提出的动态视觉Mamba(DyVM)方法通过定制化令牌剪枝和动态选择SSM块,在Vim-S上仅损失1.7%精度即可减少35.2%的计算量。
English: Mamba-based vision models face spatial redundancy, which our Dynamic Vision Mamba (DyVM) method addresses by customizing token pruning and dynamically selecting SSM blocks, reducing FLOPs by 35.2% with only a 1.7% accuracy drop on Vim-S.

Authors:Jiapeng Wang, Jinhao Jiang, Zhiqiang Zhang, Jun Zhou, Wayne Xin Zhao
Title: RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
Abstract:
The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the inner logic of the problem during generation and ensuring the verifiability of the solutions. To address these issues, we propose RV-Syn, a novel Rational and Verifiable mathematical Synthesis approach. RV-Syn constructs a structured mathematical operation function library based on initial seed problems and generates computational graphs as solutions by combining Python-formatted functions from this library. These graphs are then back-translated into complex problems. Based on the constructed computation graph, we achieve solution-guided logic-aware problem generation. Furthermore, the executability of the computational graph ensures the verifiability of the solving process. Experimental results show that RV-Syn surpasses existing synthesis methods, including those involving human-generated problems, achieving greater efficient data scaling. This approach provides a scalable framework for generating high-quality reasoning datasets.
中文: RV-Syn提出了一种理性可验证的数学合成方法,通过结构化函数库构建计算图来生成逻辑感知且解题过程可验证的数学问题,在数据扩展效率上超越了现有合成方法。
English: RV-Syn introduces a rational and verifiable mathematical synthesis method that constructs computational graphs from a structured function library to generate logic-aware problems with verifiable solutions, outperforming existing data synthesis approaches in scaling efficiency.

Authors:Chunjing Gan, Dan Yang, Binbin Hu, Ziqi Liu, Yue Shen, Zhiqiang Zhang, Jian Wang, Jun Zhou
Title: POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications
Abstract:
Large language models (LLMs) have become a disruptive force in the industry, introducing unprecedented capabilities in natural language processing, logical reasoning and so on. However, the challenges of knowledge updates and hallucination issues have limited the application of LLMs in medical scenarios, where retrieval-augmented generation (RAG) can offer significant assistance. Nevertheless, existing retrieve-then-read approaches generally digest the retrieved documents, without considering the timeliness, authoritativeness and commonality of retrieval. We argue that these approaches can be suboptimal, especially in real-world applications where information from different sources might conflict with each other and even information from the same source in different time scale might be different, and totally relying on this would deteriorate the performance of RAG approaches. We propose PolyRAG that carefully incorporate judges from different perspectives and finally integrate the polyviews for retrieval augmented generation in medical applications. Due to the scarcity of real-world benchmarks for evaluation, to bridge the gap we propose PolyEVAL, a benchmark consists of queries and documents collected from real-world medical scenarios (including medical policy, hospital & doctor inquiry and healthcare) with multiple tagging (e.g., timeliness, authoritativeness) on them. Extensive experiments and analysis on PolyEVAL have demonstrated the superiority of PolyRAG.
中文: 大语言模型在医疗应用中面临知识更新和幻觉问题,而提出的PolyRAG系统通过整合多视角判断解决了这些挑战,并借助新构建的PolyEVAL基准验证了其优越性。
English: Large language models face challenges in medical applications due to knowledge update issues and hallucinations, but the proposed PolyRAG system overcomes these by incorporating multi-perspective judgments and is validated through the newly created PolyEVAL benchmark.

Authors:Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang
Title: AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents
Abstract:
A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.
中文: AgentA/B是一种创新系统,利用基于大语言模型的自主代理模拟用户在网页上的交互行为,通过可扩展的自动化测试克服了传统A/B测试依赖真人流量的局限性。
English: AgentA/B is an innovative system that uses LLM agents to simulate user interactions on webpages, addressing the limitations of traditional A/B testing by enabling scalable, automated behavior simulation without relying on live human traffic.

Authors:Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang
Title: UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents
Abstract:
Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate their new designs. But what about evaluating and iterating the usability testing study design itself? Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and iterating their study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users and to interactively test the target website. The system also provides a Result Viewer Interface so that the UX researchers can easily review and analyze the generated qualitative (e.g., agents' post-study surveys) and quantitative data (e.g., agents' interaction logs), or even interview agents directly. Through a heuristic evaluation with 16 UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.
Chinese: UXAgent是一个创新系统,它利用大语言模型模拟代理帮助用户体验研究人员在开展真实用户研究前,通过生成模拟用户并提供交互测试与数据分析工具来评估和优化可用性测试研究设计。
English: UXAgent is a novel system that utilizes Large Language Model-simulated Agents to help UX researchers evaluate and iterate usability testing study designs by generating simulated users and providing interactive testing and data analysis tools before conducting real human studies.

Authors:Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie
Title: CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
Abstract:
As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.
中文摘要:CAReDiO框架通过利用大语言模型生成优化对话数据,有效解决现有文化数据集在代表性和独特性方面的不足,仅需少量训练样本即可实现高效的文化对齐。
English Summary: The CAReDiO framework addresses limitations in existing cultural datasets by using LLMs to generate optimized conversational data that maximizes cultural representativeness and distinctiveness, enabling effective cultural alignment with minimal training samples.

Authors:Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang
Title: Z1: Efficient Test-time Scaling with Code
Abstract:
Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., . . . ) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.
中文摘要:本文提出一种高效的测试时扩展方法,通过训练大语言模型学习代码推理轨迹来减少冗余思维标记,在保持性能的同时仅需约30%的计算量即可达到同等效果,并展现出对更广泛推理任务的良好泛化能力。
English Summary: This paper introduces an efficient test-time scaling method that trains LLMs on code reasoning trajectories to reduce unnecessary thinking tokens while maintaining performance, achieving comparable results with only 30% of the computational cost and demonstrating strong generalization to broader reasoning tasks.

Authors:Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
Title: VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Abstract:
Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.
中文:VideoVista-CulturalLingo作为首个融合多元文化与双语测试的视频评估基准,揭示了现有模型在中文内容理解和时间推理方面的不足,同时展现了在通用科学问题上的优势。
English: VideoVista-CulturalLingo is the first culturally and linguistically diverse video evaluation benchmark that reveals current AI models' limitations in handling Chinese-centric content and temporal reasoning while showing strength in general scientific questions.

Authors:Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, Hongsheng Li
Title: From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Abstract:
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.
中文摘要:ReflectionFlow是一种推理时框架,通过噪声级、提示级和反思级三个扩展维度,使扩散模型能够迭代反思并优化输出,在复杂任务中显著提升图像生成质量。
English Summary: ReflectionFlow is an inference-time framework that enables diffusion models to iteratively reflect on and refine their outputs through three scaling axes, significantly improving image quality on complex tasks.

Authors:Enyu Shi, Jiayi Zhang, Jiancheng An, Marco Di Renzo, Bo Ai, Chau Yuen
Title: Energy-Efficient SIM-assisted Communications: How Many Layers Do We Need?
Abstract:
The stacked intelligent metasurface (SIM), comprising multiple layers of reconfigurable transmissive metasurfaces, is becoming an increasingly viable solution for future wireless communication systems. In this paper, we explore the integration of SIM in a multi-antenna base station for application to downlink multi-user communications, and a realistic power consumption model for SIM-assisted systems is presented. Specifically, we focus on maximizing the energy efficiency (EE) for hybrid precoding design, i.e., the base station digital precoding and SIM wave-based beamforming. Due to the non-convexity and high complexity of the formulated problem, we employ the quadratic transformation method to reformulate the optimization problem and propose an alternating optimization (AO)-based joint precoding framework. Specifically, a successive convex approximation (SCA) algorithm is adopted for the base station precoding design. For the SIM wave-based beamforming, two algorithms are employed: the high-performance semidefinite programming (SDP) method and the low-complexity projected gradient ascent (PGA) algorithm. In particular, the results indicate that while the optimal number of SIM layers for maximizing the EE and spectral efficiency differs, a design of 2 to 5 layers can achieve satisfactory performance for both. Finally, numerical results are illustrated to evaluate the effectiveness of the proposed hybrid precoding framework and to showcase the performance enhancement achieved by the algorithm in comparison to benchmark schemes.
中文: 本文针对多用户下行通信系统,提出了一种结合数字预编码和堆叠智能超表面波束成形的能效混合预编码框架,通过交替优化算法验证了2-5层超表面结构能实现最佳性能。
English: This paper proposes an energy-efficient hybrid precoding framework combining digital precoding and stacked intelligent metasurface (SIM) beamforming for multi-user downlink communications, demonstrating that 2-5 SIM layers achieve optimal performance through alternating optimization algorithms.

Authors:Siyuan Liang, Jiayang Liu, Jiecheng Zhai, Tianmeng Fang, Rongcheng Tu, Aishan Liu, Xiaochun Cao, Dacheng Tao
Title: T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models
Abstract:
The rapid development of generative artificial intelligence has made text to video models essential for building future multimodal world simulators. However, these models remain vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. Such vulnerabilities undermine the reliability and security of simulation based applications. In this paper, we propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses, including semantic ambiguities in prompts, difficulties in detecting malicious content in dynamic video outputs, and inflexible model centric mitigation strategies. T2VShield introduces a prompt rewriting mechanism based on reasoning and multimodal retrieval to sanitize malicious inputs, along with a multi scope detection module that captures local and global inconsistencies across time and modalities. The framework does not require access to internal model parameters and works with both open and closed source systems. Extensive experiments on five platforms show that T2VShield can reduce jailbreak success rates by up to 35 percent compared to strong baselines. We further develop a human centered audiovisual evaluation protocol to assess perceptual safety, emphasizing the importance of visual level defense in enhancing the trustworthiness of next generation multimodal simulators.
中文: T2VShield框架通过基于推理和多模态检索的提示重写机制净化恶意输入,并结合跨时空与模态的多范围检测模块,为文本到视频模型提供模型无关的越狱攻击防护,实验中将攻击成功率降低达35%。
English: The T2VShield framework provides model-agnostic protection against jailbreak attacks in text-to-video models by sanitizing inputs through reasoning and multimodal retrieval while detecting inconsistencies across temporal and modal scopes, reducing attack success rates by up to 35% in experiments.

Authors:Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Title: Manipulating Multimodal Agents via Cross-Modal Prompt Injection
Abstract:
The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.
中文摘要:本文提出CrossInject攻击框架,通过在多模态中嵌入对抗性扰动来操控智能体决策,实验证明该攻击在各类任务中成功率提升超过30.1%,揭示了多模态智能体在安全关键应用中的潜在风险。
English Summary: This paper introduces CrossInject, a novel cross-modal prompt injection attack that exploits security vulnerabilities in multimodal agents by embedding adversarial perturbations across vision and text modalities to hijack decision-making processes and execute unauthorized tasks.

Authors:Yajing Xu, Zhiqiang Liu, Jiaoyan Chen, Mingchen Tu, Zhuo Chen, Jeff Z. Pan, Yichi Zhang, Yushan Zhu, Wen Zhang, Huajun Chen
Title: Multi-modal Knowledge Graph Generation with Semantics-enriched Prompts
Abstract:
Multi-modal Knowledge Graphs (MMKGs) have been widely applied across various domains for knowledge representation. However, the existing MMKGs are significantly fewer than required, and their construction faces numerous challenges, particularly in ensuring the selection of high-quality, contextually relevant images for knowledge graph enrichment. To address these challenges, we present a framework for constructing MMKGs from conventional KGs. Furthermore, to generate higher-quality images that are more relevant to the context in the given knowledge graph, we designed a neighbor selection method called Visualizable Structural Neighbor Selection (VSNS). This method consists of two modules: Visualizable Neighbor Selection (VNS) and Structural Neighbor Selection (SNS). The VNS module filters relations that are difficult to visualize, while the SNS module selects neighbors that most effectively capture the structural characteristics of the entity. To evaluate the quality of the generated images, we performed qualitative and quantitative evaluations on two datasets, MKG-Y and DB15K. The experimental results indicate that using the VSNS method to select neighbors results in higher-quality images that are more relevant to the knowledge graph.
Chinese: 该框架从传统知识图谱构建多模态知识图谱,并引入可视化结构邻居选择(VSNS)方法提升图像质量与上下文相关性,在MKG-Y和DB15K数据集上的实验验证了其有效性。
English: The proposed framework constructs Multi-modal Knowledge Graphs (MMKGs) from conventional KGs and introduces the Visualizable Structural Neighbor Selection (VSNS) method to enhance image quality and contextual relevance, with experimental validation on MKG-Y and DB15K datasets confirming its effectiveness.

Authors:Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei
Title: BitNet b1.58 2B4T Technical Report
Abstract:
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.
中文: BitNet b1.58 2B4T 是首个开源的 2B 参数 1 比特大语言模型,在保持与全精度模型相当性能的同时,显著降低了计算成本、内存占用和能耗。
English: BitNet b1.58 2B4T is the first open-source 1-bit LLM with 2 billion parameters, matching full-precision models in performance while drastically cutting computational costs and memory usage.

Authors:Stefano Maxenti, Ravis Shirkhani, Maxime Elkael, Leonardo Bonati, Salvatore D'Oro, Tommaso Melodia, Michele Polese
Title: AutoRAN: Automated and Zero-Touch Open RAN Systems
Abstract:
[...] This paper presents AutoRAN, an automated, intent-driven framework for zero-touch provisioning of open, programmable cellular networks. Leveraging cloud-native principles, AutoRAN employs virtualization, declarative infrastructure-as-code templates, and disaggregated micro-services to abstract physical resources and protocol stacks. Its orchestration engine integrates Language Models (LLMs) to translate high-level intents into machine-readable configurations, enabling closed-loop control via telemetry-driven observability. Implemented on a multi-architecture OpenShift cluster with heterogeneous compute (x86/ARM CPUs, NVIDIA GPUs) and multi-vendor Radio Access Network (RAN) hardware (Foxconn, NI), AutoRAN automates deployment of O-RAN-compliant stacks-including OpenAirInterface, NVIDIA ARC RAN, Open5GS core, and O-RAN Software Community (OSC) RIC components-using CI/CD pipelines. Experimental results demonstrate that AutoRAN is capable of deploying an end-to-end Private 5G network in less than 60 seconds with 1.6 Gbps throughput, validating its ability to streamline configuration, accelerate testing, and reduce manual intervention with similar performance than non cloud-based implementations. With its novel LLM-assisted intent translation mechanism, and performance-optimized automation workflow for multi-vendor environments, AutoRAN has the potential of advancing the robustness of next-generation cellular supply chains through reproducible, intent-based provisioning across public and private deployments.
中文摘要:AutoRAN是一个利用语言模型将高层意图转化为配置的自动化框架,可在60秒内快速部署高吞吐量私有5G网络,同时减少人工干预。
English Summary: AutoRAN is an automated framework that uses Language Models to translate high-level intents into configurations, enabling rapid deployment of private 5G networks with high throughput in under 60 seconds while reducing manual intervention.

Authors:Jiayao Yang, Jiayi Zhang, Bokai Xu, Jiakang Zheng, Zhilong Liu, Ziheng Liu, Dusit Niyato, Mérouane Debbah, Zhu Han, Bo Ai
Title: White-Box AI Model: Next Frontier of Wireless Communications
Abstract:
White-box AI (WAI), or explainable AI (XAI) model, a novel tool to achieve the reasoning behind decisions and predictions made by the AI algorithms, makes it more understandable and transparent. It offers a new approach to address key challenges of interpretability and mathematical validation in traditional black-box models. In this paper, WAI-aided wireless communication systems are proposed and investigated thoroughly to utilize the promising capabilities. First, we introduce the fundamental principles of WAI. Then, a detailed comparison between WAI and traditional black-box model is conducted in terms of optimization objectives and architecture design, with a focus on deep neural networks (DNNs) and transformer networks. Furthermore, in contrast to the traditional black-box methods, WAI leverages theory-driven causal modeling and verifiable optimization paths, thereby demonstrating potential advantages in areas such as signal processing and resource allocation. Finally, we outline future research directions for the integration of WAI in wireless communication systems.
中文: 白盒人工智能通过理论驱动的因果建模和可验证优化路径,提升了无线通信系统的透明度和可解释性,在信号处理与资源分配方面展现出优于传统黑盒模型的潜力。
English: White-box AI enhances transparency and interpretability in wireless communication systems by using theory-driven causal modeling and verifiable optimization paths, offering advantages over traditional black-box models in signal processing and resource allocation.

Authors:Yisong Xiao, Aishan Liu, Siyuan Liang, Xianglong Liu, Dacheng Tao
Title: Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models
Abstract:
LLMs have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant fairness concerns. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), a bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts to detect the emission probabilities for social groups. Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that FairMed significantly outperforms SOTA methods in effectiveness. Compared to the most effective baseline, FairMed presents competitive efficiency by cutting mitigation overhead by hundreds of minutes. FairMed also maintains the LLM's language understanding capabilities without compromising overall performance.
中文: 大语言模型从训练数据中无意习得虚假关联,固化有害社会偏见,而提出的公平中介框架通过探测并中和多层感知机中的刻板关联,有效缓解偏见,在保持模型性能的同时显著优于现有方法。
English: Large language models (LLMs) inadvertently learn spurious correlations from training data, perpetuating harmful social biases, but the proposed Fairness Mediator (FairMed) framework effectively mitigates these biases by probing and neutralizing stereotype associations in MLP layers, outperforming existing methods while preserving model performance.

Authors:Bo Chen, Zhenmei Shi, Zhao Song, Jiahao Zhang
Title: Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent
Abstract:
Recent advancements in Transformer-based architectures have led to impressive breakthroughs in natural language processing tasks, with models such as GPT-4, Claude, and Gemini demonstrating human-level reasoning abilities. However, despite their high performance, concerns remain about the inherent limitations of these models, especially when it comes to learning basic logical functions. While complexity-theoretic analyses indicate that Transformers can represent simple logic functions (e.g., $\mathsf{AND}$, $\mathsf{OR}$, and majority gates) by its nature of belonging to the $\mathsf{TC}^0$ class, these results assume ideal parameter settings and do not account for the constraints imposed by gradient descent-based training methods. In this work, we investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. We focus on a simplified variant of the Transformer architecture and consider both $n=\mathrm{poly}(d)$ and $n=\exp(Ω(d))$ number of training samples, where each sample is a $d$-size binary string paired with the output of a basic majority function. Our analysis demonstrates that even after $\mathrm{poly}(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large, growing exponentially with $d$. This work highlights fundamental optimization challenges in training Transformers for the simplest logical reasoning tasks and provides new insights into their theoretical limitations.
中文: 尽管Transformer架构理论上能表达基本逻辑函数,但基于梯度的训练方法无法有效学习简单多数函数,导致泛化误差随输入规模呈指数级增长。
English: Despite Transformers' theoretical capability to represent basic logical functions, gradient-based training fails to effectively teach them simple majority functions, resulting in exponentially high generalization errors with increasing input size.

Authors:Van-Anh Nguyen, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Optimizing Specific and Shared Parameters for Efficient Parameter Tuning
Abstract:
Foundation models, with a vast number of parameters and pretraining on massive datasets, achieve state-of-the-art performance across various applications. However, efficiently adapting them to downstream tasks with minimal computational overhead remains a challenge. Parameter-Efficient Transfer Learning (PETL) addresses this by fine-tuning only a small subset of parameters while preserving pre-trained knowledge. In this paper, we propose SaS, a novel PETL method that effectively mitigates distributional shifts during fine-tuning. SaS integrates (1) a shared module that captures common statistical characteristics across layers using low-rank projections and (2) a layer-specific module that employs hypernetworks to generate tailored parameters for each layer. This dual design ensures an optimal balance between performance and parameter efficiency while introducing less than 0.05% additional parameters, making it significantly more compact than existing methods. Extensive experiments on diverse downstream tasks, few-shot settings and domain generalization demonstrate that SaS significantly enhances performance while maintaining superior parameter efficiency compared to existing methods, highlighting the importance of capturing both shared and layer-specific information in transfer learning. Code and data are available at https://anonymous.4open.science/r/SaS-PETL-3565.
中文: 本文提出了一种新颖的参数高效迁移学习方法SaS,它通过结合共享统计模块和层级特定超网络,有效缓解微调过程中的分布偏移,在极少的参数开销下实现了卓越性能。
English: The paper introduces SaS, a novel Parameter-Efficient Transfer Learning method that combines shared statistical modules and layer-specific hypernetworks to effectively address distributional shifts during fine-tuning, achieving superior performance with minimal parameter overhead.

Authors:Bingchen Qian, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou
Title: Tree-based Models for Vertical Federated Learning: A Survey
Abstract:
Tree-based models have achieved great success in a wide range of real-world applications due to their effectiveness, robustness, and interpretability, which inspired people to apply them in vertical federated learning (VFL) scenarios in recent years. In this paper, we conduct a comprehensive study to give an overall picture of applying tree-based models in VFL, from the perspective of their communication and computation protocols. We categorize tree-based models in VFL into two types, i.e., feature-gathering models and label-scattering models, and provide a detailed discussion regarding their characteristics, advantages, privacy protection mechanisms, and applications. This study also focuses on the implementation of tree-based models in VFL, summarizing several design principles for better satisfying various requirements from both academic research and industrial deployment. We conduct a series of experiments to provide empirical observations on the differences and advances of different types of tree-based models.
Chinese: 本研究全面探讨了基于树的模型在纵向联邦学习中的应用,将其分为特征收集和标签分散两类,并分析了它们的特性、隐私保护机制及实现原则。
English: This study comprehensively examines the application of tree-based models in vertical federated learning, categorizing them into feature-gathering and label-scattering types while analyzing their characteristics, privacy mechanisms, and implementation principles.

Authors:Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng
Title: Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
Abstract:
Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.
Chinese Summary: SR-RAG是一种新颖框架,通过让大语言模型动态选择检索外部知识或表达自身参数化知识,显著提升了检索增强生成的响应准确性并降低了推理延迟。
English Summary: SR-RAG is a novel framework that enhances retrieval-augmented generation by enabling large language models to dynamically choose between retrieving external knowledge or verbalizing their own parametric knowledge, significantly improving response accuracy and reducing inference latency.

Authors:Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng
Title: Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
Abstract:
Selective retrieval improves the accuracy and efficiency of retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals. However, existing approaches underutilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide whether to retrieve external knowledge or verbalize its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM for knowledge source selection, knowledge verbalization, and response generation. SR-RAG further incorporates a nearest neighbor search mechanism at inference time to improve the accuracy of knowledge source decisions under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and reduces the inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces the number of retrievals by 29% while improving performance by 5.1%.
Chinese Summary: SR-RAG是一种新颖框架,通过让大语言模型动态选择检索外部知识或表达自身参数化知识,显著提升了检索增强生成的响应准确性并降低了推理延迟。
English Summary: SR-RAG is a novel framework that enhances retrieval-augmented generation by enabling large language models to dynamically choose between retrieving external knowledge or verbalizing their own parametric knowledge, significantly improving response accuracy and reducing inference latency.

Authors:Simone Maurizio La Cava, Roberto Casula, Sara Concas, Giulia Orrù, Ruben Tolosana, Martin Drahansky, Julian Fierrez, Gian Luca Marcialis
Title: Exploiting Multiple Representations: 3D Face Biometrics Fusion with Application to Surveillance
Abstract:
3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to the limits and characteristics of the different application scenarios. In this study, we investigate how multiple state-of-the-art 3DFR algorithms can be used to generate a better representation of subjects, with the final goal of improving the performance of face recognition systems in challenging uncontrolled scenarios. We also explore how different parametric and non-parametric score-level fusion methods can exploit the unique strengths of multiple 3DFR algorithms to enhance biometric recognition robustness. With this goal, we propose a comprehensive analysis of several face recognition systems across diverse conditions, such as varying distances and camera setups, intra-dataset and cross-dataset, to assess the robustness of the proposed ensemble method. The results demonstrate that the distinct information provided by different 3DFR algorithms can alleviate the problem of generalizing over multiple application scenarios. In addition, the present study highlights the potential of advanced fusion strategies to enhance the reliability of 3DFR-based face recognition systems, providing the research community with key insights to exploit them in real-world applications effectively. Although the experiments are carried out in a specific face verification setup, our proposed fusion-based 3DFR methods may be applied to other tasks around face biometrics that are not strictly related to identity recognition.
中文: 本研究通过分数级融合多种三维人脸重建算法,增强了人脸识别系统在不同场景下的鲁棒性,实验证明该方法能有效提升跨场景泛化能力和实际应用的可靠性。
English: This study demonstrates that combining multiple 3D face reconstruction algorithms through score-level fusion enhances face recognition robustness across diverse scenarios, with experiments confirming improved generalization and reliability for real-world applications.

Authors:Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, Jingren Zhou
Title: PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
Abstract:
In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.
中文:PolyMath是一个涵盖18种语言的多语言数学推理基准,通过评估先进大语言模型发现其存在显著的跨语言性能差异和推理一致性不足等关键挑战。
English: PolyMath is a comprehensive multilingual mathematical reasoning benchmark that evaluates advanced LLMs across 18 languages and multiple difficulty levels, revealing significant performance variations and key challenges in multilingual reasoning capabilities.

Authors:Yi Zeng, Feifei Zhao, Yuwei Wang, Enmeng Lu, Yaodong Yang, Lei Wang, Chao Liu, Yitao Liang, Dongcheng Zhao, Bing Han, Haibo Tong, Yao Liang, Dongqi Liang, Kang Sun, Boyuan Chen, Jinyu Fan
Title: Super Co-alignment of Human and AI for Sustainable Symbiotic Society
Abstract:
As Artificial Intelligence (AI) advances toward Artificial General Intelligence (AGI) and eventually Artificial Superintelligence (ASI), it may potentially surpass human control, deviate from human values, and even lead to irreversible catastrophic consequences in extreme cases. This looming risk underscores the critical importance of the "superalignment" problem - ensuring that AI systems which are much smarter than humans, remain aligned with human (compatible) intentions and values. While current scalable oversight and weak-to-strong generalization methods demonstrate certain applicability, they exhibit fundamental flaws in addressing the superalignment paradigm - notably, the unidirectional imposition of human values cannot accommodate superintelligence's autonomy or ensure AGI/ASI's stable learning. We contend that the values for sustainable symbiotic society should be co-shaped by humans and living AI together, achieving "Super Co-alignment." Guided by this vision, we propose a concrete framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity's evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the Self, others, and society, integrating self-awareness, self-reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively prioritizing human well-being. The integration of externally-driven oversight with intrinsically-driven proactive alignment will co-shape symbiotic values and rules through iterative human-ASI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for human, and for a symbiotic ecology.
中文摘要:本文提出"超级协同对齐"框架,通过结合外部人类监督与AI内在主动对齐机制,应对超智能AI系统风险,主张人机协同塑造价值观以确保AGI/ASI的安全发展。
English Summary: The paper proposes a "Super Co-alignment" framework combining external human oversight with AI's intrinsic proactive alignment to address risks from superintelligent AI systems, advocating for human-AI collaborative value formation to ensure safe AGI/ASI development.

Authors:Kai Zhao, Zhaohui Yang, Ye Hu, Mingzhe Chen, Chen Zhu, Zhaoyang Zhang
Title: Efficient Split Federated Learning for Large Language Models over Communication Networks
Abstract:
Fine-tuning pre-trained large language models (LLMs) in a distributed manner poses significant challenges on resource-constrained edge networks. To address this challenge, we propose SflLLM, a novel framework that integrates split federated learning with parameter-efficient fine-tuning techniques. By leveraging model splitting and low-rank adaptation (LoRA), SflLLM reduces the computational burden on edge devices. Furthermore, the introduction of a federated server facilitates parallel training and enhances data privacy. To accommodate heterogeneous communication conditions and diverse computational capabilities of edge devices, as well as the impact of LoRA rank selection on model convergence and training cost, we formulate a joint optimization problem of both communication and computation resource. The formulated problem jointly optimizes subchannel allocation, power control, model splitting point selection, and LoRA rank configuration, aimed at minimizing total training delay. An iterative optimization algorithm is proposed to solve this problem efficiently. Specifically, a greedy heuristic is employed for subchannel allocation, the power control subproblem is reformulated as a convex optimization problem using auxiliary variables, and an exhaustive search is adopted for optimal split position and rank selection. Simulation results demonstrate that the proposed SflLLM framework achieves comparable model accuracy while significantly reducing client-side computational requirements. Furthermore, the proposed resource allocation scheme and adaptive LoRA rank selection strategy notably reduce the training latency compared to conventional approaches.
中文:SflLLM框架通过融合分割联邦学习与参数高效微调技术,在降低边缘设备计算负担的同时保持模型精度,并通过优化资源配置显著减少训练延迟。
English: The SflLLM framework combines split federated learning with parameter-efficient fine-tuning to reduce computational load on edge devices while maintaining model accuracy and minimizing training delay through optimized resource allocation.

Authors:Fenghao Zhu, Xinquan Wang, Siming Jiang, Xinyi Li, Maojun Zhang, Yixuan Chen, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Zhaoyang Zhang, Richeng Jin, Yongming Huang, Wei Feng, Tingting Yang, Baoming Bai, Feifei Gao, Kun Yang, Yuanwei Liu, Sami Muhaidat, Chau Yuen, Kaibin Huang, Kai-Kit Wong, Dusit Niyato, Ying-Chang Liang, Mérouane Debbah
Title: Wireless Large AI Model: Shaping the AI-Native Future of 6G and Beyond
Abstract:
The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and decision-making. In light of these remarkable capabilities, this paper provides a comprehensive survey of WLAM, elucidating its fundamental principles, diverse applications, critical challenges, and future research opportunities. We begin by introducing the background of WLAM and analyzing the key synergies with wireless networks, emphasizing the mutual benefits. Subsequently, we explore the foundational characteristics of WLAM, delving into their unique relevance in wireless environments. Then, the role of WLAM in optimizing wireless communication systems across various use cases and the reciprocal benefits are systematically investigated. Furthermore, we discuss the integration of WLAM with emerging technologies, highlighting their potential to enable transformative capabilities and breakthroughs in wireless communication. Finally, we thoroughly examine the high-level challenges hindering the practical implementation of WLAM and discuss pivotal future research directions.
中文: 本文全面综述了无线大人工智能模型作为变革性技术,详细探讨其基本原理、多样化应用、关键挑战及未来研究方向,并重点分析其与无线网络的协同增效关系。
English: This paper comprehensively surveys wireless large AI models (WLAM) as a transformative technology for next-generation communication systems, covering their principles, applications, challenges, and future directions while highlighting their synergistic relationship with wireless networks.

Authors:Zhisheng Huang, Peng Wang, Jingdong Zhang, Yuan Liu, Xin Li, Wenping Wang
Title: 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS
Abstract:
3D Gaussian Splatting (3DGS) has revolutionized neural rendering with its efficiency and quality, but like many novel view synthesis methods, it heavily depends on accurate camera poses from Structure-from-Motion (SfM) systems. Although recent SfM pipelines have made impressive progress, questions remain about how to further improve both their robust performance in challenging conditions (e.g., textureless scenes) and the precision of camera parameter estimation simultaneously. We present 3R-GS, a 3D Gaussian Splatting framework that bridges this gap by jointly optimizing 3D Gaussians and camera parameters from large reconstruction priors MASt3R-SfM. We note that naively performing joint 3D Gaussian and camera optimization faces two challenges: the sensitivity to the quality of SfM initialization, and its limited capacity for global optimization, leading to suboptimal reconstruction results. Our 3R-GS, overcomes these issues by incorporating optimized practices, enabling robust scene reconstruction even with imperfect camera registration. Extensive experiments demonstrate that 3R-GS delivers high-quality novel view synthesis and precise camera pose estimation while remaining computationally efficient. Project page: https://zsh523.github.io/3R-GS/
Chinese: 3R-GS是一种新型的3D高斯泼溅框架,通过联合优化3D高斯分布和相机参数,即使在相机初始化不完善的情况下也能实现鲁棒的场景重建和高质量的新视角合成。
English: 3R-GS is a novel 3D Gaussian Splatting framework that jointly optimizes 3D Gaussians and camera parameters, enabling robust scene reconstruction and high-quality novel view synthesis even with imperfect camera initialization.

Authors:Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang, Zhumin Chen, Zhaochun Ren, Maarten de Rijke
Title: Cognitive Debiasing Large Language Models for Decision-Making
Abstract:
Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts only contain one type of cognitive bias, limiting their effectiveness in more challenging scenarios involving multiple cognitive biases. To fill this gap, we propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD), that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts. Experimental results on finance, healthcare, and legal decision-making tasks, using both closed-source and open-source LLMs, demonstrate that the proposed SACD method outperforms both advanced prompt engineering methods and existing cognitive debiasing techniques in average accuracy under single-bias and multi-bias settings.
中文: 大语言模型在决策支持中潜力显著,但存在认知偏差问题;本文提出的自适应认知去偏差方法通过迭代优化提示,在单偏差和多偏差场景下均优于现有技术,提高了多个领域的决策准确性。
English: Large language models (LLMs) show promise in decision-making support but face challenges from cognitive biases, which the proposed self-adaptive cognitive debiasing (SACD) method effectively mitigates by iteratively refining prompts, outperforming existing techniques in accuracy across various domains.

Authors:Xinquan Wang, Fenghao Zhu, Chongwen Huang, Zhaohui Yang, Zhaoyang Zhang, Sami Muhaidat, Chau Yuen, Mérouane Debbah
Title: TeleMoM: Consensus-Driven Telecom Intelligence via Mixture of Models
Abstract:
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coordination. To address this issue, we propose Telecom mixture of models (TeleMoM), a consensus-driven ensemble framework that integrates multiple LLMs for enhanced decision-making in Telecom. TeleMoM employs a two-stage process: proponent models generate justified responses, and an adjudicator finalizes decisions, supported by a quality-checking mechanism. This approach leverages strengths of diverse models to improve accuracy, reduce biases, and handle domain-specific complexities effectively. Evaluation results demonstrate that TeleMoM achieves a 9.7\% increase in answer accuracy, highlighting its effectiveness in Telecom applications.
中文摘要:提出的TeleMoM框架通过共识驱动的两阶段流程整合多个大语言模型,显著提升了电信领域的决策准确性,实现了9.7%的精度提升。
English Summary: The proposed TeleMoM framework enhances decision-making in telecommunications by integrating multiple large language models through a consensus-driven, two-stage process, achieving a 9.7% accuracy improvement.

Authors:Siwei Wang, Zhiwei Chen, Liujuan Cao, Rongrong Ji
Title: Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection
Abstract:
Small object detection is a broadly investigated research task and is commonly conceptualized as a "pipeline-style" engineering process. In the upstream, images serve as raw materials for processing in the detection pipeline, where pre-trained models are employed to generate initial feature maps. In the midstream, an assigner selects training positive and negative samples. Subsequently, these samples and features are fed into the downstream for classification and regression. Previous small object detection methods often focused on improving isolated stages of the pipeline, thereby neglecting holistic optimization and consequently constraining overall performance gains. To address this issue, we have optimized three key aspects, namely Purifying, Labeling, and Utilizing, in this pipeline, proposing a high-quality Small object detection framework termed PLUSNet. Specifically, PLUSNet comprises three sequential components: the Hierarchical Feature Purifier (HFP) for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks. The proposed PLUS modules are readily integrable into various object detectors, thus enhancing their detection capabilities in multi-scale scenarios. Extensive experiments demonstrate the proposed PLUSNet consistently achieves significant and consistent improvements across multiple datasets for small object detection.
中文摘要:该摘要提出PLUSNet框架,通过净化上游特征、优化中游训练样本和提升下游信息利用,全面改进小目标检测流程,在多个数据集上实现了显著且稳定的性能提升。
English Summary: The abstract introduces PLUSNet, a holistic framework that optimizes small object detection by purifying features, refining training samples, and enhancing information utilization across the pipeline, achieving consistent performance gains on multiple datasets.

Authors:Kai Ye, Haidi Tang, Bowen Liu, Pingyang Dai, Liujuan Cao, Rongrong Ji
Title: More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV
Abstract:
Applications of unmanned aerial vehicle (UAV) in logistics, agricultural automation, urban management, and emergency response are highly dependent on oriented object detection (OOD) to enhance visual perception. Although existing datasets for OOD in UAV provide valuable resources, they are often designed for specific downstream tasks.Consequently, they exhibit limited generalization performance in real flight scenarios and fail to thoroughly demonstrate algorithm effectiveness in practical environments. To bridge this critical gap, we introduce CODrone, a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions. It also serves as a new benchmark designed to align with downstream task requirements, ensuring greater applicability and robustness in UAV-based OOD.Based on application requirements, we identify four key limitations in current UAV OOD datasets-low image resolution, limited object categories, single-view imaging, and restricted flight altitudes-and propose corresponding improvements to enhance their applicability and robustness.Furthermore, CODrone contains a broad spectrum of annotated images collected from multiple cities under various lighting conditions, enhancing the realism of the benchmark. To rigorously evaluate CODrone as a new benchmark and gain deeper insights into the novel challenges it presents, we conduct a series of experiments based on 22 classical or SOTA methods.Our evaluation not only assesses the effectiveness of CODrone in real-world scenarios but also highlights key bottlenecks and opportunities to advance OOD in UAV applications.Overall, CODrone fills the data gap in OOD from UAV perspective and provides a benchmark with enhanced generalization capability, better aligning with practical applications and future algorithm development.
中文: CODrone数据集通过提供反映真实场景的综合基准,解决了无人机定向物体检测现有资源的局限性,从而提升了在各种情境下的泛化能力和实用性。
English: The CODrone dataset addresses the limitations of existing oriented object detection resources for UAVs by providing a comprehensive benchmark that reflects real-world conditions, enhancing generalization and applicability across various scenarios.

Authors:Xin Wang, Haoyang Li, Haibo Chen, Zeyang Zhang, Wenwu Zhu
Title: Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models
Abstract:
Large language models (LLMs) have substantially advanced machine learning research, including natural language processing, computer vision, data mining, etc., yet they still exhibit critical limitations in explainability, reliability, adaptability, and extensibility. In this paper, we overview a promising learning paradigm, i.e., Modular Machine Learning (MML), as an essential approach toward new-generation LLMs capable of addressing these issues. We begin by systematically and comprehensively surveying the existing literature on modular machine learning, with a particular focus on modular data representation and modular models. Then, we propose a unified MML framework for LLMs, which decomposes the complex structure of LLMs into three interdependent components: modular representation, modular model, and modular reasoning. Specifically, the MML paradigm discussed in this article is able to: i) clarify the internal working mechanism of LLMs through the disentanglement of semantic components; ii) allow for flexible and task-adaptive model design; iii) enable an interpretable and logic-driven decision-making process. We further elaborate a feasible implementation of MML-based LLMs via leveraging advanced techniques such as disentangled representation learning, neural architecture search and neuro-symbolic learning. Last but not least, we critically identify the remaining key challenges, such as the integration of continuous neural and discrete symbolic processes, joint optimization, and computational scalability, present promising future research directions that deserve further exploration. Ultimately, we believe the integration of the MML with LLMs has the potential to bridge the gap between statistical (deep) learning and formal (logical) reasoning, thereby paving the way for robust, adaptable, and trustworthy AI systems across a wide range of real-world applications.
中文摘要:模块化机器学习(MML)作为一种新范式,通过模块化表示、模型和推理来提升大语言模型的可解释性与适应性,但神经与符号系统的融合等关键挑战仍需解决。
English Summary: Modular Machine Learning (MML) is proposed as a paradigm to enhance large language models by improving their explainability, adaptability, and reasoning through modular components, though challenges like neural-symbolic integration remain.

Authors:Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
Title: Taming the Titans: A Survey of Efficient LLM Inference Serving
Abstract:
Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.
中文摘要:本文系统综述了针对大语言模型推理中内存与计算瓶颈的优化方法,涵盖实例级、集群级及新兴场景策略,并指出了未来研究方向。
English Summary: This paper surveys recent methods to optimize Large Language Model inference by addressing memory and computational challenges, covering instance-level, cluster-level, and emerging strategies while identifying future research directions.

Authors:Xize Cheng, Slytherin Wang, Zehan Wang, Rongjie Huang, Tao Jin, Zhou Zhao
Title: Unleashing the Power of Natural Audio Featuring Multiple Sound Sources
Abstract:
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at https://clearsep.github.io.
中文摘要:ClearSep是一种创新框架,通过数据引擎将自然混合音频分解为独立音轨,结合迭代优化和针对性训练策略,实现了真实场景下的高效声音分离。
English Summary: ClearSep is a novel framework that uses a data engine to decompose naturally mixed audio into independent tracks, enabling effective sound separation in real-world scenarios through iterative optimization and tailored training strategies.

Authors:Luyuan Zhang, Xidong Mu, An Liu, Yuanwei Liu
Title: Two-Timescale Joint Transmit and Pinching Beamforming for Pinching-Antenna Systems
Abstract:
Pinching antenna systems (PASS) have been proposed as a revolutionary flexible antenna technology which facilitates line-of-sight links via numerous low-cost pinching antennas with adjustable activation positions over waveguides. This letter proposes a two-timescale joint transmit and pinching beamforming design for the maximization of sum rate of a PASS-based downlink multi-user multiple input single output system. A primal dual decomposition method is developed to decouple the two-timescale problem into two sub-problems: 1) A Karush-Kuhn-Tucker-guided dual learning-based approach is proposed to solve the short-term transmit beamforming design sub-problem; 2) The long-term pinching beamforming design sub-problem is tackled by adopting a stochastic successive convex approximation method. Simulation results demonstrate that the proposed two-timescale algorithm achieves a significant performance gain compared to other baselines.
中文: 本文提出了一种基于夹持天线系统的双时间尺度联合发射与夹持波束成形设计,通过原始对偶分解和优化方法最大化多用户系统的总速率,仿真结果表明其性能显著优于现有基准方案。
English: This letter introduces a two-timescale joint transmit and pinching beamforming design for PASS-based multi-user systems, employing primal dual decomposition and optimization methods to maximize sum rate, with simulations showing superior performance over baselines.

Authors:Wei Zou, Sen Yang, Yu Bao, Shujian Huang, Jiajun Chen, Shanbo Cheng
Title: Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data
Abstract:
The rise of Large Language Models (LLMs) has reshaped machine translation (MT), but multilingual MT still relies heavily on parallel data for supervised fine-tuning (SFT), facing challenges like data scarcity for low-resource languages and catastrophic forgetting. To address these issues, we propose TRANS-ZERO, a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of LLM. TRANS-ZERO combines Genetic Monte-Carlo Tree Search (G-MCTS) with preference optimization, achieving strong translation performance that rivals supervised methods. Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions. Further analysis reveals that G-MCTS itself significantly enhances translation quality by exploring semantically consistent candidates through iterative translations, providing a robust foundation for the framework's succuss.
中文: TRANS-ZERO是一个仅使用单语数据和大型语言模型内在多语言知识的自博弈框架,通过结合遗传蒙特卡洛树搜索与偏好优化,实现了与监督方法相媲美的翻译性能,尤其在非英语翻译方向上表现优异。
English: TRANS-ZERO is a self-play framework that uses only monolingual data and LLMs' multilingual capabilities, combining G-MCTS with preference optimization to achieve translation performance comparable to supervised methods, especially excelling in non-English directions.

Authors:Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang
Title: Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Abstract:
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.
Chinese: 本文提出了一种基于扩散时间步的递归视觉标记方法,构建了统一的多模态理解与生成框架,在两项任务上均实现了优于现有模型的性能表现。
English: This paper introduces a novel approach to Multimodal Large Language Models by developing recursive visual tokens based on diffusion timesteps, enabling unified visual comprehension and generation that outperforms existing methods.

Authors:Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu
Title: Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM
Abstract:
Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at https://modular-cam.github.io.
中文: 本文提出Modular-Cam新方法,通过大语言模型解析复杂文本指令为多场景序列,并采用专用模块实现流畅镜头切换与跨场景一致性,显著提升了动态多场景视频的生成质量。
English: This paper introduces Modular-Cam, a novel text-to-video generation method that leverages large language models to decompose complex prompts into multiple scenes and employs specialized modules for smooth camera transitions and cross-scene consistency.

Authors:Yu Lin, Jianghang Lin, Kai Ye, You Shen, Yan Zhang, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Title: S$^2$Teacher: Step-by-step Teacher for Sparsely Annotated Oriented Object Detection
Abstract:
Although fully-supervised oriented object detection has made significant progress in multimodal remote sensing image understanding, it comes at the cost of labor-intensive annotation. Recent studies have explored weakly and semi-supervised learning to alleviate this burden. However, these methods overlook the difficulties posed by dense annotations in complex remote sensing scenes. In this paper, we introduce a novel setting called sparsely annotated oriented object detection (SAOOD), which only labels partial instances, and propose a solution to address its challenges. Specifically, we focus on two key issues in the setting: (1) sparse labeling leading to overfitting on limited foreground representations, and (2) unlabeled objects (false negatives) confusing feature learning. To this end, we propose the S$^2$Teacher, a novel method that progressively mines pseudo-labels for unlabeled objects, from easy to hard, to enhance foreground representations. Additionally, it reweights the loss of unlabeled objects to mitigate their impact during training. Extensive experiments demonstrate that S$^2$Teacher not only significantly improves detector performance across different sparse annotation levels but also achieves near-fully-supervised performance on the DOTA dataset with only 10% annotation instances, effectively balancing detection accuracy with annotation efficiency. The code will be public.
中文: 本文提出稀疏标注定向目标检测(SAOOD)以减少标注负担,并设计了S$^2$Teacher方法,通过逐步生成伪标签和调整损失权重来应对过拟合和漏标问题,在极少标注下实现接近全监督的性能。
English: The paper introduces Sparsely Annotated Oriented Object Detection (SAOOD) to reduce annotation costs and proposes S$^2$Teacher, a method that progressively generates pseudo-labels and adjusts loss weights to overcome overfitting and false negatives, achieving near-fully-supervised performance with minimal annotations.

Authors:Zheng Zhang, Zhaolin Wang, Xidong Mu, Bingtao He, Jian Chen, Yuanwei Liu
Title: Integrated Sensing and Communications for Pinching-Antenna Systems (PASS)
Abstract:
An integrated sensing and communication (ISAC) design for pinching antenna systems (PASS) is proposed, where the pinching antennas are deployed to establish reliable line-of-sight communication and sensing links. More particularly, a separated ISAC design is proposed for the two-waveguide PASS, where one waveguide is used to emit the information-bearing signals for ISAC transmission while the other waveguide is used to receive the reflected echo signals. Based on this framework, a penalty-based alternating optimization algorithm is proposed to maximize the illumination power as well as ensure the communication quality-of-service requirement. Numerical results demonstrate that the proposed PASS-ISAC scheme outperforms the conventional antenna scheme.
中文: 本文提出了一种采用夹持天线系统的集成感知与通信设计,通过基于惩罚的交替优化算法在保证通信质量的同时提升照明功率,结果显示其性能优于传统天线方案。
English: This paper introduces an integrated sensing and communication design using pinching antenna systems, employing a penalty-based alternating optimization algorithm to enhance illumination power while maintaining communication quality, with results showing superior performance over conventional antenna approaches.

Authors:Chendi Ge, Xin Wang, Ziwei Zhang, Yijian Qin, Hong Chen, Haiyang Wu, Yang Zhang, Yuekui Yang, Wenwu Zhu
Title: Behavior Importance-Aware Graph Neural Architecture Search for Cross-Domain Recommendation
Abstract:
Cross-domain recommendation (CDR) mitigates data sparsity and cold-start issues in recommendation systems. While recent CDR approaches using graph neural networks (GNNs) capture complex user-item interactions, they rely on manually designed architectures that are often suboptimal and labor-intensive. Additionally, extracting valuable behavioral information from source domains to improve target domain recommendations remains challenging. To address these challenges, we propose Behavior importance-aware Graph Neural Architecture Search (BiGNAS), a framework that jointly optimizes GNN architecture and data importance for CDR. BiGNAS introduces two key components: a Cross-Domain Customized Supernetwork and a Graph-Based Behavior Importance Perceptron. The supernetwork, as a one-shot, retrain-free module, automatically searches the optimal GNN architecture for each domain without the need for retraining. The perceptron uses auxiliary learning to dynamically assess the importance of source domain behaviors, thereby improving target domain recommendations. Extensive experiments on benchmark CDR datasets and a large-scale industry advertising dataset demonstrate that BiGNAS consistently outperforms state-of-the-art baselines. To the best of our knowledge, this is the first work to jointly optimize GNN architecture and behavior data importance for cross-domain recommendation.
中文摘要:提出的BiGNAS框架通过自动优化图神经网络架构并评估源域行为重要性,有效提升了跨领域推荐性能,在实验中显著优于现有先进方法。
English Summary: The proposed BiGNAS framework automatically optimizes graph neural network architectures and assesses source domain behavior importance to enhance cross-domain recommendation performance, demonstrating superior results over existing methods.

Authors:Teng Shi, Jun Xu, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li
Title: Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation
Abstract:
Recently, the personalization of Large Language Models (LLMs) to generate content that aligns with individual user preferences has garnered widespread attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves relevant documents from the user's history to reflect their preferences and enhance LLM generation, is one commonly used approach for personalization. However, existing personalized RAG methods do not consider that the histories of similar users can also assist in personalized generation for the current user, meaning that collaborative information between users can also benefit personalized generation. Inspired by the application of collaborative filtering in recommender systems, we propose a method called CFRAG, which adapts Collaborative Filtering to RAG for personalized text generation. However, this presents two challenges: (1)~how to incorporate collaborative information without explicit user similarity labels? (2)~how to retrieve documents that support personalized LLM generation? For Challenge 1, we use contrastive learning to train user embeddings to retrieve similar users and introduce collaborative information. For Challenge 2, we design a personalized retriever and reranker to retrieve the top-$k$ documents from these users' histories. We take into account the user's preference during retrieval and reranking. Then we leverage feedback from the LLM to fine-tune the personalized retriever and reranker, enabling them to retrieve documents that meet the personalized generation needs of the LLM. Experimental results on the Language Model Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further analysis confirms the importance of incorporating collaborative information.
中文: CFRAG方法将协同过滤引入个性化RAG,通过对比学习识别相似用户并设计结合大语言模型反馈的检索器,有效利用用户间的协同信息提升个性化文本生成效果。
English: CFRAG introduces collaborative filtering into personalized RAG by using contrastive learning to identify similar users and designing specialized retrievers that incorporate LLM feedback, effectively enhancing personalized text generation through shared user history insights.

Authors:Teng Shi, Jun Xu, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Enyun Yu
Title: Unified Generative Search and Recommendation
Abstract:
Modern commercial platforms typically offer both search and recommendation functionalities to serve diverse user needs, making joint modeling of these tasks an appealing direction. While prior work has shown that integrating search and recommendation can be mutually beneficial, it also reveals a performance trade-off: enhancements in one task often come at the expense of the other. This challenge arises from their distinct information requirements: search emphasizes semantic relevance between queries and items, whereas recommendation depends more on collaborative signals among users and items. Effectively addressing this trade-off requires tackling two key problems: (1) integrating both semantic and collaborative signals into item representations, and (2) guiding the model to distinguish and adapt to the unique demands of search and recommendation. The emergence of generative retrieval with Large Language Models (LLMs) presents new possibilities. This paradigm encodes items as identifiers and frames both search and recommendation as sequential generation tasks, offering the flexibility to leverage multiple identifiers and task-specific prompts. In light of this, we introduce GenSAR, a unified generative framework for balanced search and recommendation. Our approach designs dual-purpose identifiers and tailored training strategies to incorporate complementary signals and align with task-specific objectives. Experiments on both public and commercial datasets demonstrate that GenSAR effectively reduces the trade-off and achieves state-of-the-art performance on both tasks.
中文: GenSAR是一个统一的生成式框架,通过设计双用途标识符和定制化训练策略,融合语义与协同信号来平衡搜索与推荐任务,在减少性能权衡的同时实现了两项任务的顶尖表现。
English: GenSAR is a unified generative framework that uses dual-purpose identifiers and tailored training strategies to effectively balance search and recommendation tasks by incorporating both semantic and collaborative signals, achieving state-of-the-art performance while reducing the trade-off between them.

Authors:Zhengwei Tao, Zhi Jin, Bincheng Li, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao
Title: PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation
Abstract:
Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles. However, because there is no consideration on whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions and then filtered the data using CIL, resulting in an inferable benchmark for event prediction. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into event prediction with the aid of CIL. Subsequently, we evaluate several representative prediction systems on PROPHET, drawing valuable insights for future directions.
中文: PROPHET基准通过引入因果干预似然度(CIL)来确保预测问题具备可推断性,解决了现有基准中问题缺乏有效支持依据的缺陷,为事件预测系统提供了更可靠的评估框架。
English: The PROPHET benchmark addresses the issue of noninferable questions in event prediction by introducing Causal Intervened Likelihood (CIL) to ensure questions are supported by valid rationales, enabling more reliable evaluation of forecasting systems.

Authors:Changshuo Zhang, Xiao Zhang, Teng Shi, Jun Xu, Ji-Rong Wen
Title: Test-Time Alignment for Tracking User Interest Shifts in Sequential Recommendation
Abstract:
Sequential recommendation is essential in modern recommender systems, aiming to predict the next item a user may interact with based on their historical behaviors. However, real-world scenarios are often dynamic and subject to shifts in user interests. Conventional sequential recommendation models are typically trained on static historical data, limiting their ability to adapt to such shifts and resulting in significant performance degradation during testing. Recently, Test-Time Training (TTT) has emerged as a promising paradigm, enabling pre-trained models to dynamically adapt to test data by leveraging unlabeled examples during testing. However, applying TTT to effectively track and address user interest shifts in recommender systems remains an open and challenging problem. Key challenges include how to capture temporal information effectively and explicitly identifying shifts in user interests during the testing phase. To address these issues, we propose T$^2$ARec, a novel model leveraging state space model for TTT by introducing two Test-Time Alignment modules tailored for sequential recommendation, effectively capturing the distribution shifts in user interest patterns over time. Specifically, T$^2$ARec aligns absolute time intervals with model-adaptive learning intervals to capture temporal dynamics and introduce an interest state alignment mechanism to effectively and explicitly identify the user interest shifts with theoretical guarantees. These two alignment modules enable efficient and incremental updates to model parameters in a self-supervised manner during testing, enhancing predictions for online recommendation. Extensive evaluations on three benchmark datasets demonstrate that T$^2$ARec achieves state-of-the-art performance and robustly mitigates the challenges posed by user interest shifts.
中文摘要:T$^2$ARec模型通过两个测试时对齐模块,利用状态空间模型动态捕捉用户兴趣漂移,在测试阶段以自监督方式实现参数增量更新,显著提升了序列推荐的性能。
English Summary: T$^2$ARec introduces test-time alignment modules using state space models to dynamically capture user interest shifts in sequential recommendations, achieving superior performance through self-supervised updates during testing.

Authors:You Wang, Zekun Li, Lei Qi, Qian Yu, Yinghuan Shi, Yang Gao
Title: Balancing Multi-Target Semi-Supervised Medical Image Segmentation with Collaborative Generalist and Specialists
Abstract:
Despite the promising performance achieved by current semi-supervised models in segmenting individual medical targets, many of these models suffer a notable decrease in performance when tasked with the simultaneous segmentation of multiple targets. A vital factor could be attributed to the imbalanced scales among different targets: during simultaneously segmenting multiple targets, large targets dominate the loss, leading to small targets being misclassified as larger ones. To this end, we propose a novel method, which consists of a Collaborative Generalist and several Specialists, termed CGS. It is centered around the idea of employing a specialist for each target class, thus avoiding the dominance of larger targets. The generalist performs conventional multi-target segmentation, while each specialist is dedicated to distinguishing a specific target class from the remaining target classes and the background. Based on a theoretical insight, we demonstrate that CGS can achieve a more balanced training. Moreover, we develop cross-consistency losses to foster collaborative learning between the generalist and the specialists. Lastly, regarding their intrinsic relation that the target class of any specialized head should belong to the remaining classes of the other heads, we introduce an inter-head error detection module to further enhance the quality of pseudo-labels. Experimental results on three popular benchmarks showcase its superior performance compared to state-of-the-art methods.
中文: 现有半监督医学图像分割模型在多目标任务中因尺度失衡而性能下降,本文提出的CGS方法通过通用-专家协作架构与交叉一致性损失实现均衡训练,在三个基准测试中展现出优越性能。
English: Current semi-supervised medical image segmentation models struggle with multi-target tasks due to scale imbalance, but the proposed CGS method employs collaborative generalist-specialist architecture with cross-consistency losses to achieve balanced training and superior performance.

Authors:Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, Yunfang Wu
Title: Dynamic Fisher-weighted Model Merging via Bayesian Optimization
Abstract:
The fine-tuning of pre-trained language models has resulted in the widespread availability of task-specific models. Model merging offers an efficient way to create multi-task models by combining these fine-tuned models at the parameter level, without the need for training data or joint training on multiple datasets. Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise. Both approaches exhibit their own weaknesses, leading to a notable performance gap compared to multi-task fine-tuning. In this paper, we unify these seemingly distinct strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge). Specifically, candidate models are associated with a set of coefficients that linearly scale their fine-tuned parameters. Bayesian optimization is applied to dynamically adjust these coefficients, aiming to maximize overall performance on validation sets. Each iteration of this process integrates parameter importance based on the Fisher information conditioned by the coefficients. Experimental results show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks. Our analysis shows that the effectiveness of DF-Merge arises from the unified view of merging and that near-optimal performance is achievable in a few iterations, even with minimal validation data.
Chinese: DF-Merge将模型级和参数级合并策略统一到一个通用框架中,利用贝叶斯优化基于Fisher信息动态调整系数,在少量验证数据下即可在不同任务和模型规模上实现卓越性能。
English: DF-Merge unifies model-wise and parameter-wise merging strategies into a general framework, using Bayesian optimization to dynamically adjust coefficients based on Fisher information, achieving superior performance across diverse tasks and model sizes with minimal validation data.

Authors:Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
Title: DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models
Abstract:
Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.
中文: DistilQwen2.5是基于Qwen2.5开发的轻量化蒸馏模型系列,通过多智能体教师模型和模型融合技术,在降低计算成本的同时显著提升了指令遵循能力。
English: DistilQwen2.5 is a family of distilled lightweight LLMs derived from Qwen2.5, utilizing multi-agent teacher models and model fusion techniques to significantly enhance instruction-following capabilities while reducing computational costs.

Authors:Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara
Title: Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation
Abstract:
In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.
中文摘要:本文提出Fashion-RAG创新方法,通过检索匹配文本描述的服装并利用文本反转技术将其属性融入生成图像,实现了基于文字描述的时尚定制功能。
English Summary: This paper introduces Fashion-RAG, a novel method that enhances virtual fashion customization by retrieving garments matching textual descriptions and integrating their attributes into AI-generated images using textual inversion techniques.

Authors:Bingyan Liu, Chengyu Wang, Tongtong Su, Huan Ten, Jun Huang, Kailing Guo, Kui Jia
Title: Understanding Attention Mechanism in Video Diffusion Models
Abstract:
Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.
中文: 本研究通过信息论的扰动分析发现,文本到视频模型中高熵注意力图与优质视频生成密切相关,并基于此提出了仅需轻量级注意力操作即可提升视频质量和实现文本引导编辑的创新方法。
English: This study conducts an information-theoretic perturbation analysis of spatial and temporal attention blocks in text-to-video models, revealing that high-entropy attention maps correlate with superior video quality while enabling novel methods for quality enhancement and text-guided editing through lightweight attention manipulation.

Authors:Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Title: Training Small Reasoning LLMs with Cognitive Preference Alignment
Abstract:
The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.
中文:CRV框架和CogPO算法通过使思维过程与认知能力对齐,成功训练出参数更少但性能强大的推理大语言模型,在基准测试中大幅领先其他方法。
English: The CRV framework and CogPO algorithm enable training smaller reasoning LLMs by aligning thought processes with their cognitive capacities, significantly outperforming other methods on benchmarks.

Authors:Chengyu Wang, Taolin Zhang, Richang Hong, Jun Huang
Title: A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions
Abstract:
Recently, the reasoning capabilities of large reasoning models (LRMs), such as DeepSeek-R1, have seen significant advancements through the slow thinking process. Despite these achievements, the substantial computational demands of LRMs present considerable challenges. In contrast, small reasoning models (SRMs), often distilled from larger ones, offer greater efficiency and can exhibit distinct capabilities and cognitive trajectories compared to LRMs. This work surveys around 170 recently published papers on SRMs for tackling various complex reasoning tasks. We review the current landscape of SRMs and analyze diverse training and inference techniques related to SRMs. Furthermore, we provide a comprehensive review of SRMs for domain-specific applications and discuss possible future research directions. This survey serves as an essential reference for researchers to leverage or develop SRMs for advanced reasoning functionalities with high efficiency.
大型推理模型虽取得显著进展但计算成本高昂,而小型推理模型则具备更高效率与独特能力,本文通过综述170篇文献系统探讨了其训练方法、领域应用及未来研究方向。
Large reasoning models have advanced significantly but face high computational costs, while small reasoning models offer greater efficiency and unique capabilities, as surveyed in this review of 170 papers covering training techniques, domain applications, and future directions.

Authors:Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang
Title: NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Abstract:
Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models' underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle'' and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs' genuine table understanding ability. Our data, code and models will be released to facilitate future research.
中文摘要:本研究提出了NeedleInATable(NIAT)基准测试,旨在评估大语言模型对长结构化表格中单个单元格的细粒度感知能力,揭示了模型依赖捷径而非真正理解的性能差距,并证明使用NIAT训练数据能有效提升该能力及下游表格任务表现。
English Summary: The study introduces NeedleInATable (NIAT), a benchmark designed to assess large language models' fine-grained perception of individual cells in long structured tables, revealing a performance gap that suggests reliance on shortcuts rather than genuine understanding, and demonstrates that training with NIAT data improves both this capability and downstream table tasks.

Authors:Yong Ren, Jiangyan Yi, Tao Wang, Jianhua Tao, Zheng Lian, Zhengqi Wen, Chenxing Li, Ruibo Fu, Ye Bai, Xiaohui Zhang
Title: P2Mark: Plug-and-play Parameter-level Watermarking for Neural Speech Generation
Abstract:
Neural speech generation (NSG) has rapidly advanced as a key component of artificial intelligence-generated content, enabling the generation of high-quality, highly realistic speech for diverse applications. This development increases the risk of technique misuse and threatens social security. Audio watermarking can embed imperceptible marks into generated audio, providing a promising approach for secure NSG usage. However, current audio watermarking methods are mainly applied at the audio-level or feature-level, which are not suitable for open-sourced scenarios where source codes and model weights are released. To address this limitation, we propose a Plug-and-play Parameter-level WaterMarking (P2Mark) method for NSG. Specifically, we embed watermarks into the released model weights, offering a reliable solution for proactively tracing and protecting model copyrights in open-source scenarios. During training, we introduce a lightweight watermark adapter into the pre-trained model, allowing watermark information to be merged into the model via this adapter. This design ensures both the flexibility to modify the watermark before model release and the security of embedding the watermark within model parameters after model release. Meanwhile, we propose a gradient orthogonal projection optimization strategy to ensure the quality of the generated audio and the accuracy of watermark preservation. Experimental results on two mainstream waveform decoders in NSG (i.e., vocoder and codec) demonstrate that P2Mark achieves comparable performance to state-of-the-art audio watermarking methods that are not applicable to open-source white-box protection scenarios, in terms of watermark extraction accuracy, watermark imperceptibility, and robustness.
中文: 提出的即插即用参数级水印方法(P2Mark)将不可感知的水印直接嵌入神经语音生成模型参数中,在开源场景下实现安全的版权保护,同时保持音频质量和水印鲁棒性。
English: The proposed Plug-and-play Parameter-level WaterMarking (P2Mark) method embeds imperceptible watermarks directly into neural speech generation model weights, enabling secure copyright protection in open-source scenarios while maintaining audio quality and watermark robustness.

Authors:Wang Wei, Tiankai Yang, Hongjie Chen, Ryan A. Rossi, Yue Zhao, Franck Dernoncourt, Hoda Eldardiry
Title: Efficient Model Selection for Time Series Forecasting via LLMs
Abstract:
Model selection is a critical step in time series forecasting, traditionally requiring extensive performance evaluations across various datasets. Meta-learning approaches aim to automate this process, but they typically depend on pre-constructed performance matrices, which are costly to build. In this work, we propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection. Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs. Through extensive experiments with LLaMA, GPT and Gemini, we demonstrate that our approach outperforms traditional meta-learning techniques and heuristic baselines, while significantly reducing computational overhead. These findings underscore the potential of LLMs in efficient model selection for time series forecasting.
中文摘要:本研究提出利用大型语言模型进行时间序列预测模型选择的新方法,通过调用模型内在知识与推理能力替代传统性能矩阵,在超越传统元学习方法的同时显著提升了计算效率。
English Summary: This study introduces a novel approach using Large Language Models (LLMs) for time series forecasting model selection, replacing costly performance matrices with LLMs' built-in knowledge and reasoning to achieve superior performance and efficiency over traditional methods.

Authors:Zhen Tan, Huan Liu
Title: Intrinsic Barriers to Explaining Deep Foundation Models
Abstract:
Deep Foundation Models (DFMs) offer unprecedented capabilities but their increasing complexity presents profound challenges to understanding their internal workings-a critical need for ensuring trust, safety, and accountability. As we grapple with explaining these systems, a fundamental question emerges: Are the difficulties we face merely temporary hurdles, awaiting more sophisticated analytical techniques, or do they stem from \emph{intrinsic barriers} deeply rooted in the nature of these large-scale models themselves? This paper delves into this critical question by examining the fundamental characteristics of DFMs and scrutinizing the limitations encountered by current explainability methods when confronted with this inherent challenge. We probe the feasibility of achieving satisfactory explanations and consider the implications for how we must approach the verification and governance of these powerful technologies.
中文摘要:本文探讨深度基础模型理解困难的根源是暂时性障碍还是内在本质问题,通过分析模型特性与现有解释方法的局限性,评估实现可靠解释的可能性及其对技术验证与治理的影响。
English Summary: The abstract questions whether the challenges in understanding Deep Foundation Models are temporary or intrinsic, exploring their fundamental traits and the limitations of current explainability methods to assess the feasibility of achieving reliable explanations for trustworthy governance.

Authors:Yusheng Zhao, Junyu Luo, Xiao Luo, Weizhi Zhang, Zhiping Xiao, Wei Ju, Philip S. Yu, Ming Zhang
Title: Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness
Abstract:
Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
中文: 本研究对多模态大语言模型进行全面评估,发现其具备强大的泛化能力但过度依赖视觉输入且易受对抗样本影响,为未来改进提供了指导方向。
English: This study conducts a comprehensive evaluation of multi-modal large language models, revealing their strong generalization abilities but heavy reliance on visual inputs and susceptibility to adversarial attacks, while offering insights for future improvements.

Authors:Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
Title: Learning to Reason under Off-Policy Guidance
Abstract:
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
中文:LUFFY框架通过引入离策略推理轨迹克服了同策略RLVR的局限性,在多个基准测试中取得显著性能提升,并在先前方法完全失败的场景中成功实现了模型训练。
English: The LUFFY framework overcomes the limitations of on-policy RLVR by incorporating off-policy reasoning traces, achieving significant performance gains across multiple benchmarks and enabling successful training where previous methods failed.

Authors:Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakhturina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Title: Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning
Abstract:
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.
中文: NEMOTRON-CROSSTHINK通过在多领域数据中融入结构化模板进行强化学习训练,显著提升了数学与非数学推理任务的准确率,同时将正确答案的响应标记减少28%,实现了更高效通用的语言模型推理能力。
English: NEMOTRON-CROSSTHINK enhances LLM reasoning by integrating multi-domain data and structured templates in RL training, achieving significant accuracy improvements across math and non-math benchmarks while reducing response tokens by 28%.

Authors:Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Title: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Abstract:
Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
中文: 结合注意力机制与状态空间模型的混合架构通过创新的分组感知剪枝策略,在保持SSM结构完整性的同时,将模型压缩至40亿参数,实现推理速度翻倍和精度提升,且训练数据量减少40倍。
English: Hybrid LLM architectures combining Attention and State Space Models can be compressed using a novel group-aware pruning strategy that preserves SSM integrity, achieving a 4B parameter model with 2x faster inference and higher accuracy while requiring 40x fewer training tokens.

Authors:Yuxuan Chen, Shanshan Huang, Yunyao Cheng, Peng Chen, Zhongwen Rao, Yang Shu, Bin Yang, Lujia Pan, Chenjuan Guo
Title: AimTS: Augmented Series and Image Contrastive Learning for Time Series Classification
Abstract:
Time series classification (TSC) is an important task in time series analysis. Existing TSC methods mainly train on each single domain separately, suffering from a degradation in accuracy when the samples for training are insufficient in certain domains. The pre-training and fine-tuning paradigm provides a promising direction for solving this problem. However, time series from different domains are substantially divergent, which challenges the effective pre-training on multi-source data and the generalization ability of pre-trained models. To handle this issue, we introduce Augmented Series and Image Contrastive Learning for Time Series Classification (AimTS), a pre-training framework that learns generalizable representations from multi-source time series data. We propose a two-level prototype-based contrastive learning method to effectively utilize various augmentations in multi-source pre-training, which learns representations for TSC that can be generalized to different domains. In addition, considering augmentations within the single time series modality are insufficient to fully address classification problems with distribution shift, we introduce the image modality to supplement structural information and establish a series-image contrastive learning to improve the generalization of the learned representations for TSC tasks. Extensive experiments show that after multi-source pre-training, AimTS achieves good generalization performance, enabling efficient learning and even few-shot learning on various downstream TSC datasets.
中文摘要:AimTS框架通过双层级原型对比学习和时序-图像跨模态对齐,从多源数据中学习可泛化的表征,有效解决了时序分类中的领域适应问题,并在少样本场景下展现出优越性能。
English Summary: The AimTS framework introduces a novel pre-training approach using two-level prototype-based contrastive learning and cross-modal alignment between time series and images to learn generalizable representations that enable effective few-shot learning across diverse time series classification domains.

Authors:Li Sun, Zhenhao Huang, Yujie Wang, Hongbo Lv, Chunyang Liu, Hao Peng, Philip S. Yu
Title: IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space
Abstract:
Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers K, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without K considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a new Differentiable Structural Information, generalizing the discrete formalism to continuous realm, so that the optimal partitioning tree, revealing the cluster structure, can be created by the gradient backpropagation. Theoretically, we demonstrate its capability in clustering without requiring K and identifying the minority clusters in imbalanced graphs, while reducing the time complexity to O(N) w.r.t. the number of nodes. Subsequently, we present a novel IsoSEL framework for deep graph clustering, where we design a hyperbolic neural network to learn the partitioning tree in the Lorentz model of hyperbolic space, and further conduct Lorentz Tree Contrastive Learning with isometric augmentation. As a result, the partitioning tree incorporates node attributes via mutual information maximization, while the cluster assignment is refined by the proposed tree contrastive learning. Extensive experiments on five benchmark datasets show the IsoSEL outperforms 14 recent baselines by an average of +1.3% in NMI.
中文摘要:本文提出IsoSEL新型深度图聚类框架,通过可微分结构信息和双曲神经网络结合对比学习,无需预设聚类数量即可有效处理不平衡图中的少数类簇识别问题。
English Summary: This paper introduces IsoSEL, a novel deep graph clustering framework that eliminates the need for predefined cluster numbers and effectively handles imbalanced graphs by leveraging differentiable structural information and hyperbolic neural networks with contrastive learning.

Authors:Zhen Tan, Song Wang, Yifan Li, Yu Kong, Jundong Li, Tianlong Chen, Huan Liu
Title: Are We Merely Justifying Results ex Post Facto? Quantifying Explanatory Inversion in Post-Hoc Model Explanations
Abstract:
Post-hoc explanation methods provide interpretation by attributing predictions to input features. Natural explanations are expected to interpret how the inputs lead to the predictions. Thus, a fundamental question arises: Do these explanations unintentionally reverse the natural relationship between inputs and outputs? Specifically, are the explanations rationalizing predictions from the output rather than reflecting the true decision process? To investigate such explanatory inversion, we propose Inversion Quantification (IQ), a framework that quantifies the degree to which explanations rely on outputs and deviate from faithful input-output relationships. Using the framework, we demonstrate on synthetic datasets that widely used methods such as LIME and SHAP are prone to such inversion, particularly in the presence of spurious correlations, across tabular, image, and text domains. Finally, we propose Reproduce-by-Poking (RBP), a simple and model-agnostic enhancement to post-hoc explanation methods that integrates forward perturbation checks. We further show that under the IQ framework, RBP theoretically guarantees the mitigation of explanatory inversion. Empirically, for example, on the synthesized data, RBP can reduce the inversion by 1.8% on average across iconic post-hoc explanation approaches and domains.
中文: 本研究提出反转量化(IQ)框架来评估后解释方法对因果关系的颠倒程度,并设计模型无关的增强方法"主动复现"(RBP)来减少解释反转,实验证明其在合成数据中能平均降低1.8%的反转现象。
English: This study introduces Inversion Quantification (IQ) to measure how post-hoc explanations may reverse input-output causality and proposes Reproduce-by-Poking (RBP) to mitigate this inversion, demonstrating its effectiveness across multiple domains and methods like LIME and SHAP.

Authors:Yonchanok Khaokaew, Flora D. Salim, Andreas Züfle, Hao Xue, Taylor Anderson, C. Raina MacIntyre, Matthew Scotch, David J Heslop
Title: Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare
Abstract:
Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.
中文: 基于大语言模型的生成代理在模拟人类行为(如医疗决策)方面具有潜力,但通过与“理解美国研究”调查数据的对比发现,它们可能引入偏见且无法准确代表真实个体,突显了其在行为研究中的风险与局限性。
English: Generative agents created using large language models show promise for simulating human behavior in studies like healthcare decision-making, but they risk introducing biases and may not accurately reflect real individuals, as demonstrated by comparisons with survey data from the Understanding America Study.

Authors:Yuanhong Yu, Xingyi He, Chen Zhao, Junhao Yu, Jiaqi Yang, Ruizhen Hu, Yujun Shen, Xing Zhu, Xiaowei Zhou, Sida Peng
Title: BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
Abstract:
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.
中文: 本文提出了一种基于RGB的通用物体姿态估计方法,通过将物体边界框角点作为中间表示,结合新型参考点合成器在稀疏视角和遮挡场景中实现稳健性能,在多个基准数据集上展现出优于现有方法的表现。
English: This paper introduces a generalizable RGB-based method for object pose estimation that uses object corner points as an intermediate representation, enabling robust performance in sparse-view and occlusion scenarios through a novel reference-based point synthesizer and demonstrating superior results on benchmark datasets.

Authors:Yuanhong Yu, Xingyi He, Chen Zhao, Junhao Yu, Jiaqi Yang, Ruizhen Hu, Yujun Shen, Xing Zhu, Xiaowei Zhou, Sida Peng
Title: BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
Abstract:
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.
中文: 本文提出了一种基于RGB的通用物体姿态估计方法,通过将物体边界框角点作为中间表示,结合新型参考点合成器在稀疏视角和遮挡场景中实现稳健性能,在多个基准数据集上展现出优于现有方法的表现。
English: This paper introduces a generalizable RGB-based method for object pose estimation that uses object corner points as an intermediate representation, enabling robust performance in sparse-view and occlusion scenarios through a novel reference-based point synthesizer and demonstrating superior results on benchmark datasets.

Authors:Dingkun Yan, Xinrui Wang, Yusuke Iwasawa, Yutaka Matsuo, Suguru Saito, Jiaxian Guo
Title: ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities
Abstract:
Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the \textbf{carrier}, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.
中文: 本研究提出了一种新颖的工作流程,通过动态调整潜在载体来解决基于参考的线稿上色中的空间和语义错位问题,采用分割交叉注意力和专用编码器等机制,显著提升了上色质量和细节合成效果。
English: This study introduces a novel workflow that dynamically adapts the latent carrier to address spatial and semantic misalignments in reference-based sketch colorization, utilizing mechanisms like split cross-attention and specialized encoders to enhance colorization quality and detail synthesis.

Authors:Zhilin Wang, Yafu Li, Xiaoye Qu, Yu Cheng
Title: SEE: Continual Fine-tuning with Sequential Ensemble of Experts
Abstract:
Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.
Chinese: 顺序专家集成(SEE)框架通过让每个专家独立处理查询而无需路由器,有效解决了大型语言模型持续微调中的灾难性遗忘问题,其性能超越先前方法,并能识别分布外查询,展现出卓越的泛化能力。
English: The Sequential Ensemble of Experts (SEE) framework effectively addresses catastrophic forgetting in continual fine-tuning of large language models by enabling each expert to independently handle queries without a router, outperforming previous methods and demonstrating strong generalization by identifying out-of-distribution queries.

Authors:Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro
Title: From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Abstract:
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.
中文: 本研究提出一种高效训练方法,可将大语言模型的上下文长度从128K扩展到4M词元,在保持指令遵循和推理能力的同时,在长上下文基准测试中达到最优性能,并在标准任务中保持竞争力。
English: This work introduces an efficient training method to extend LLMs' context length from 128K to 4M tokens while maintaining instruction-following and reasoning capabilities, achieving state-of-the-art performance on long-context benchmarks and competitive results on standard tasks.

Authors:Xinpeng Ding, Kui Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Xiaomeng Li
Title: PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Abstract:
Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs), but its reliance on offline preference data limits adaptability and fails to capture true video-response misalignment. We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation by leveraging video augmentations to generate rejected samples while keeping responses fixed. However, selecting effective augmentations is non-trivial, as some clips may be semantically identical to the original under specific prompts, leading to false rejections and disrupting alignment. To address this, we introduce Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO), which selects augmentations based on prompt context. Instead of a single rejection, we construct a candidate set of augmented clips and apply a close-to-far selection strategy, initially ensuring all clips are semantically relevant while then prioritizing the most prompt-aware distinct clip. This allows the model to better capture meaningful visual differences, mitigating hallucinations, while avoiding false rejections, and improving alignment. PaMi-VDPOseamlessly integrates into existing VLLMs without additional parameters, GPT-4/human supervision. With only 10k SFT data, it improves the base model by 5.3% on VideoHallucer, surpassing GPT-4o, while maintaining stable performance on general video benchmarks.
PaMi-VDPO introduces an online preference learning framework that uses prompt-aware multi-instance selection of video augmentations to reduce hallucinations in VLLMs without needing preference annotations or extra parameters, achieving a 5.3% improvement on VideoHallucer benchmarks.
English Summary:

Authors:Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, Jaehun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
Title: Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning
Abstract:
Large reasoning models exhibit remarkable reasoning capabilities via long, elaborate reasoning trajectories. Supervised fine-tuning on such reasoning traces, also known as distillation, can be a cost-effective way to boost reasoning capabilities of student models. However, empirical observations reveal that these reasoning trajectories are often suboptimal, switching excessively between different lines of thought, resulting in under-thinking, over-thinking, and even degenerate responses. We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large reasoning models. Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can then lead to student models with enhanced reasoning capabilities with shorter, thus faster inference. Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed thought traces, and weak-to-strong improvement, where a weaker model revises stronger model's thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned on its own Retro-Search-ed traces, reduces the average reasoning length by 31.2% while improving performance by 7.7% across seven math benchmarks. For weak-to-strong improvement, we retrospectively revise R1-671B's traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data. Our work counters recently emergent viewpoints that question the relevance of search algorithms in the era of large reasoning models, by demonstrating that there are still opportunities for algorithmic advancements, even for frontier models.
大型推理模型常产生冗长且不理想的推理路径,而Retro-Search算法通过优化这些路径为更短、更高质量的轨迹,从而通过自我改进和弱到强蒸馏机制,显著提升学生模型的推理效率与性能。
Large reasoning models can generate suboptimal, lengthy reasoning paths, but the Retro-Search algorithm improves them by refining these paths into shorter, higher-quality traces, thereby enhancing student models' reasoning efficiency and performance through self-improvement and weak-to-strong distillation.

Authors:Alexander Naumann, Xunjiang Gu, Tolga Dimlioglu, Mariusz Bojarski, Alperen Degirmenci, Alexander Popov, Devansh Bisla, Marco Pavone, Urs Müller, Boris Ivanovic
Title: Data Scaling Laws for End-to-End Autonomous Driving
Abstract:
Autonomous vehicle (AV) stacks have traditionally relied on decomposed approaches, with separate modules handling perception, prediction, and planning. However, this design introduces information loss during inter-module communication, increases computational overhead, and can lead to compounding errors. To address these challenges, recent works have proposed architectures that integrate all components into an end-to-end differentiable model, enabling holistic system optimization. This shift emphasizes data engineering over software integration, offering the potential to enhance system performance by simply scaling up training resources. In this work, we evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours with both open-loop metrics and closed-loop simulations. Specifically, we investigate how much additional training data is needed to achieve a target performance gain, e.g., a 5% improvement in motion prediction accuracy. By understanding the relationship between model performance and training dataset size, we aim to provide insights for data-driven decision-making in autonomous driving development.
Chinese: 本研究评估了端到端自动驾驶模型的性能与训练数据量之间的关系,通过开环和闭环测试分析16至8192小时数据集,旨在确定实现目标性能提升所需的数据规模。
English: This study evaluates how training data volume affects the performance of an end-to-end autonomous driving model, aiming to determine the data needed for target performance gains through open- and closed-loop testing on datasets ranging from 16 to 8192 hours.

Authors:NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey, Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen, Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle, Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi, Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger, Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary, Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau, Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, Sanjeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majumdar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri, Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman, Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin, Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zijia Chen
Title: Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Abstract:
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.
Chinese: Nemotron-H 推出了混合 Mamba-Transformer 模型,在保持竞争力的准确性的同时显著降低推理成本,推理速度比同类先进模型快达3倍。
English: Nemotron-H introduces hybrid Mamba-Transformer models that significantly reduce inference costs while maintaining competitive accuracy, achieving up to 3x faster inference than comparable state-of-the-art models.

Authors:Zihan Chen, Song Wang, Zhen Tan, Xingbo Fu, Zhenyu Lei, Peng Wang, Huan Liu, Cong Shen, Jundong Li
Title: A Survey of Scaling in Large Language Model Reasoning
Abstract:
The rapid advancements in large Language models (LLMs) have significantly enhanced their reasoning capabilities, driven by various strategies such as multi-agent collaboration. However, unlike the well-established performance improvements achieved through scaling data and model size, the scaling of reasoning in LLMs is more complex and can even negatively impact reasoning performance, introducing new challenges in model alignment and robustness. In this survey, we provide a comprehensive examination of scaling in LLM reasoning, categorizing it into multiple dimensions and analyzing how and to what extent different scaling strategies contribute to improving reasoning capabilities. We begin by exploring scaling in input size, which enables LLMs to process and utilize more extensive context for improved reasoning. Next, we analyze scaling in reasoning steps that improves multi-step inference and logical consistency. We then examine scaling in reasoning rounds, where iterative interactions refine reasoning outcomes. Furthermore, we discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement. Finally, we review applications of scaling across domains and outline future directions for further advancing LLM reasoning. By synthesizing these diverse perspectives, this survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of next-generation AI systems.
中文摘要:本综述全面探讨了在输入规模、推理步骤及交互轮次等多维度上扩展策略如何提升大语言模型的推理能力,同时分析了此类扩展带来的复杂性挑战。
English Summary: This survey comprehensively examines how scaling strategies across multiple dimensions—such as input size, reasoning steps, and iterative rounds—enhance large language models' reasoning capabilities, while addressing the complexities and challenges that arise from such scaling.

Authors:Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
Title: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Abstract:
We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.
中文: ILLUME+ 采用双重视觉标记器和扩散解码器,提升了深度语义理解与高保真图像生成能力,解决了现有统一模型在理解、生成和编辑任务中的局限性。
English: ILLUME+ introduces a dual visual tokenizer and diffusion decoder to enhance deep semantic understanding and high-fidelity image generation, overcoming limitations in unified models for understanding, generation, and editing tasks.

Authors:Xiao Tang, Kexin Zhao, Chao Shen, Qinghe Du, Yichen Wang, Dusit Niyato, Zhu Han
Title: Deep Graph Reinforcement Learning for UAV-Enabled Multi-User Secure Communications
Abstract:
While unmanned aerial vehicles (UAVs) with flexible mobility are envisioned to enhance physical layer security in wireless communications, the efficient security design that adapts to such high network dynamics is rather challenging. The conventional approaches extended from optimization perspectives are usually quite involved, especially when jointly considering factors in different scales such as deployment and transmission in UAV-related scenarios. In this paper, we address the UAV-enabled multi-user secure communications by proposing a deep graph reinforcement learning framework. Specifically, we reinterpret the security beamforming as a graph neural network (GNN) learning task, where mutual interference among users is managed through the message-passing mechanism. Then, the UAV deployment is obtained through soft actor-critic reinforcement learning, where the GNN-based security beamforming is exploited to guide the deployment strategy update. Simulation results demonstrate that the proposed approach achieves near-optimal security performance and significantly enhances the efficiency of strategy determination. Moreover, the deep graph reinforcement learning framework offers a scalable solution, adaptable to various network scenarios and configurations, establishing a robust basis for information security in UAV-enabled communications.
中文: 本文提出了一种深度图强化学习框架,通过图神经网络和强化学习优化无人机部署和安全波束成形,在多用户通信中实现了接近最优的安全性能和更高的策略决策效率。
English: This paper introduces a deep graph reinforcement learning framework that optimizes UAV deployment and security beamforming through graph neural networks and reinforcement learning, achieving near-optimal security performance and enhanced efficiency for multi-user communications.

Authors:Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, Wentao Zhang
Title: CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Abstract:
Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow, namely implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises 5,258 problems from Codeforces and is continuously updated via an automated pipeline, which decomposes each problem into subproblems with unit tests based on dependency tree analysis and dataflow analysis. We further propose a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees. Extensive experiments on 16 popular LLMs reveal significant performance degradation in multi-turn scenarios. For instance, o1-mini retains only 20.8% Pass@1 in multi-turn scenario versus 37.8% in single-turn scenario. More fine-grained analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.
中文摘要:本研究提出了首个评估大语言模型通过迭代代码复用实现功能能力的基准测试CodeFlowBench,揭示了多轮场景下模型性能显著下降的问题,确立了其作为代码生成研究关键工具的地位。
English Summary: The study introduces CodeFlowBench, the first benchmark to evaluate LLMs' ability to implement functionality through iterative code reuse, revealing significant performance degradation in multi-turn scenarios and establishing it as a vital tool for code generation research.

Authors:Joey Chan, Qiao Jin, Nicholas Wan, Charalampos S. Floudas, Elisabetta Xue, Zhiyong Lu
Title: Recommending Clinical Trials for Online Patient Cases using Artificial Intelligence
Abstract:
Clinical trials are crucial for assessing new treatments; however, recruitment challenges - such as limited awareness, complex eligibility criteria, and referral barriers - hinder their success. With the growth of online platforms, patients increasingly turn to social media and health communities for support, research, and advocacy, expanding recruitment pools and established enrollment pathways. Recognizing this potential, we utilized TrialGPT, a framework that leverages a large language model (LLM) as its backbone, to match 50 online patient cases (collected from published case reports and a social media website) to clinical trials and evaluate performance against traditional keyword-based searches. Our results show that TrialGPT outperforms traditional methods by 46% in identifying eligible trials, with each patient, on average, being eligible for around 7 trials. Additionally, our outreach efforts to case authors and trial organizers regarding these patient-trial matches yielded highly positive feedback, which we present from both perspectives.
中文摘要:TrialGPT框架利用大型语言模型,将在线患者病例与临床试验匹配的表现比传统关键词方法提升46%,且相关方对匹配结果反馈积极。
English Summary: TrialGPT, a framework using a large language model, significantly outperforms traditional keyword-based methods by 46% in matching online patient cases to eligible clinical trials, with positive feedback from outreach efforts.

Authors:Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang
Title: semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
Abstract:
Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.
中文:现有LLM服务系统在解耦设计中存在存储效率问题,因此提出半解耦系统semi-PD,通过解耦计算与统一存储相结合,在高请求率下提升性能并显著降低延迟。
English: Existing LLM serving systems face storage inefficiency in disaggregated designs, prompting the proposal of semi-PD, which employs disaggregated computation and unified storage to enhance performance and reduce latency under high request rates.

Authors:Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, Guohao Dai, Yu Wang
Title: FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
Abstract:
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. FlashOverlap utilizes a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.
Chinese: FlashOverlap系统通过创新的信号触发机制,在保持计算性能的同时实现分块级计算与通信重叠,有效解决了多GPU系统中通信瓶颈问题,最高可获得1.65倍的加速效果。
English: The proposed FlashOverlap system overcomes inter-GPU communication bottlenecks in generative models by enabling interference-free, tile-wise overlapping of computation and communication through a novel signaling mechanism, achieving up to 1.65x speedup.

Authors:Shintaro Ozaki, Kazuki Hayashi, Yusuke Sakai, Jingun Kwon, Hidetaka Kamigaito, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe
Title: TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation
Abstract:
Generating images from prompts containing specific entities requires models to retain as much entity-specific knowledge as possible. However, fully memorizing such knowledge is impractical due to the vast number of entities and their continuous emergence. To address this, we propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER), which augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) to mitigate performance degradation from longer inputs. To evaluate our method, we introduce WiT-Cub (WiT with Captions and Uncomplicated Background-explanations), a dataset comprising captions, images, and an entity list. Experiments on four image generation models and five LLMs show that TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts. Additionally, multiple annotators' evaluation confirms that the summarized descriptions are more informative, validating LLMs' ability to generate concise yet rich descriptions. These findings demonstrate that refining prompts with augmented and summarized entity-related descriptions enhances image generation capabilities. The code and dataset will be available upon acceptance.
Chinese: 提出的TextTIGER方法通过增强实体提示的知识并用大语言模型进行总结,相比仅使用标题的方法,提高了图像生成的性能指标并产生更具信息量的描述。
English: The proposed TextTIGER method enhances image generation by enriching entity prompts with augmented knowledge and summarizing them using LLMs, which improves performance metrics and produces more informative descriptions compared to standard caption-only approaches.

Authors:Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
Title: MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Abstract:
The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.
Chinese: MMInference提出了一种动态稀疏注意力方法,可在不修改模型的情况下将长上下文多模态输入的预填充阶段加速高达8.3倍,同时保持准确性。
English: MMInference introduces a dynamic sparse attention method that accelerates the pre-filling stage for long-context multimodal inputs by up to 8.3x while maintaining accuracy, without requiring model modifications.

Authors:Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, Yong Liu
Title: From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Abstract:
Memory is the process of encoding, storing, and retrieving information, allowing humans to retain experiences, knowledge, skills, and facts over time, and serving as the foundation for growth and effective interaction with the world. It plays a crucial role in shaping our identity, making decisions, learning from past experiences, building relationships, and adapting to changes. In the era of large language models (LLMs), memory refers to the ability of an AI system to retain, recall, and use information from past interactions to improve future responses and interactions. Although previous research and reviews have provided detailed descriptions of memory mechanisms, there is still a lack of a systematic review that summarizes and analyzes the relationship between the memory of LLM-driven AI systems and human memory, as well as how we can be inspired by human memory to construct more powerful memory systems. To achieve this, in this paper, we propose a comprehensive survey on the memory of LLM-driven AI systems. In particular, we first conduct a detailed analysis of the categories of human memory and relate them to the memory of AI systems. Second, we systematically organize existing memory-related work and propose a categorization method based on three dimensions (object, form, and time) and eight quadrants. Finally, we illustrate some open problems regarding the memory of current AI systems and outline possible future directions for memory in the era of large language models.
中文: 本文系统综述了人类记忆与大型语言模型驱动的人工智能系统记忆机制之间的关联,提出了基于三个维度的分类方法,并指出了增强人工智能记忆能力的未来研究方向。
English: This paper presents a systematic survey comparing human memory mechanisms with those of large language model-driven AI systems, proposing a new categorization framework and identifying future research directions to enhance AI memory capabilities.

Authors:Zongyuan Chen, Yan Xia, Jiayuan Liu, Jijia Liu, Wenhao Tang, Jiayu Chen, Feng Gao, Longfei Ma, Hongen Liao, Yu Wang, Chao Yu, Boyu Zhang, Fei Xing
Title: Hysteresis-Aware Neural Network Modeling and Whole-Body Reinforcement Learning Control of Soft Robots
Abstract:
Soft robots exhibit inherent compliance and safety, which makes them particularly suitable for applications requiring direct physical interaction with humans, such as surgical procedures. However, their nonlinear and hysteretic behavior, resulting from the properties of soft materials, presents substantial challenges for accurate modeling and control. In this study, we present a soft robotic system designed for surgical applications and propose a hysteresis-aware whole-body neural network model that accurately captures and predicts the soft robot's whole-body motion, including its hysteretic behavior. Building upon the high-precision dynamic model, we construct a highly parallel simulation environment for soft robot control and apply an on-policy reinforcement learning algorithm to efficiently train whole-body motion control strategies. Based on the trained control policy, we developed a soft robotic system for surgical applications and validated it through phantom-based laser ablation experiments in a physical environment. The results demonstrate that the hysteresis-aware modeling reduces the Mean Squared Error (MSE) by 84.95 percent compared to traditional modeling methods. The deployed control algorithm achieved a trajectory tracking error ranging from 0.126 to 0.250 mm on the real soft robot, highlighting its precision in real-world conditions. The proposed method showed strong performance in phantom-based surgical experiments and demonstrates its potential for complex scenarios, including future real-world clinical applications.
中文: 本研究提出了一种针对外科软体机器人的迟滞感知神经网络模型和强化学习控制方法,在物理验证中将均方误差降低84.95%,并实现了亚毫米级的轨迹跟踪精度。
English: This study introduces a hysteresis-aware neural network model and reinforcement learning control for a surgical soft robot, achieving 84.95% MSE reduction and sub-millimeter tracking precision in physical validations.

Authors:Zhihang Yuan, Rui Xie, Yuzhang Shang, Hanling Zhang, Siyuan Wang, Shengen Yan, Guohao Dai, Yu Wang
Title: VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate
Abstract:
Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
Chinese: 本文提出VGDFR方法,通过根据运动频率动态调整潜在帧率,以训练免费的方式提升扩散变换器视频生成效率,在保持质量的同时实现高达3倍的加速效果。
English: The paper introduces VGDFR, a training-free method that enhances the efficiency of Diffusion Transformer-based video generation by dynamically adjusting the latent frame rate according to motion frequency, achieving up to 3x speedup with minimal quality loss.

Authors:Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models
Abstract:
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
中文: 本研究系统优化了自编码器设计并引入创新训练策略,实现了高压缩视频编码、移动端实时解码能力及卓越重建质量。
English: This work systematically optimizes autoencoder design and introduces novel training strategies, achieving high-compression video encoding with real-time decoding on mobile devices and superior reconstruction quality.

Authors:Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models
Abstract:
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
中文: 本研究系统优化了自编码器设计并引入创新训练策略,实现了高压缩视频编码、移动端实时解码能力及卓越重建质量。
English: This work systematically optimizes autoencoder design and introduces novel training strategies, achieving high-compression video encoding with real-time decoding on mobile devices and superior reconstruction quality.

Authors:Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li
Title: xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Abstract:
With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.
Chinese: 针对现有评估方法难以处理推理模型复杂输出的问题,我们开发了xVerify验证器,它能有效判断答案等价性并在测试中达到95%以上准确率,部分模型甚至超越了GPT-4o的表现。
English: To address the limitations of existing evaluation methods for reasoning models with complex outputs, we introduce xVerify, an efficient answer verifier that demonstrates strong equivalence judgment capabilities and achieves over 95% accuracy in evaluations, even outperforming GPT-4o in some variants.

Authors:Keyan Xu, Dingzirui Wang, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che
Title: Abacus-SQL: A Text-to-SQL System Empowering Cross-Domain and Open-Domain Database Retrieval
Abstract:
The existing text-to-SQL systems have made significant progress in SQL query generation, but they still face numerous challenges. Existing systems often lack retrieval capabilities for open-domain databases, requiring users to manually filter relevant databases. Additionally, their cross-domain transferability is limited, making it challenging to accommodate diverse query requirements. To address these issues, we propose Abacus-SQL. Abacus-SQL utilizes database retrieval technology to accurately locate the required databases in an open-domain database environment. It also enhances the system cross-domain transfer ability through data augmentation methods. Moreover, Abacus-SQL employs Pre-SQL and Self-debug methods, thereby enhancing the accuracy of SQL queries. Experimental results demonstrate that Abacus-SQL performs excellently in multi-turn text-to-SQL tasks, effectively validating the approach's effectiveness. Abacus-SQL is publicly accessible at https://huozi.8wss.com/abacus-sql/.
中文:Abacus-SQL通过引入开放域数据库检索技术和数据增强方法提升跨域迁移能力,结合Pre-SQL与自我调试机制优化SQL生成准确性,在多轮文本转SQL任务中表现优异,有效解决了现有系统的局限性。
English: Abacus-SQL addresses limitations in existing text-to-SQL systems by incorporating database retrieval for open-domain environments and enhancing cross-domain transferability through data augmentation, while improving SQL accuracy with Pre-SQL and Self-debug methods, demonstrating strong performance in multi-turn tasks.

Authors:Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
Title: On The Landscape of Spoken Language Models: A Comprehensive Survey
Abstract:
The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both "pure" language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
中文: 该领域正从特定任务模型转向通用语音语言模型,本文通过分类架构、训练与评估方法,综述了相关研究并指出了未来挑战。
English: The field is transitioning from task-specific models to universal spoken language models (SLMs), which this survey categorizes by architecture, training, and evaluation while outlining future challenges.

Authors:Tingwei Lu, Yangning Li, Liyuan Wang, Binghuai Lin, Jiwei Tang, Qingsong Lv, Wanshi Xu, Hai-Tao Zheng, Yinghui Li, Xin Su, Zifei Shan
Title: LSR-MCTS: Alleviating Long Range Dependency in Code Generation
Abstract:
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.
中文:LSR-MCTS算法通过逐行处理代码,结合蒙特卡洛树搜索和自优化机制,有效提升了代码生成质量,在公开基准测试中超越了现有最优方法。
English: The LSR-MCTS algorithm addresses limitations in code generation by processing code line-by-line using Monte Carlo Tree Search and a self-refine mechanism, achieving state-of-the-art results on public benchmarks.

Authors:Yangning Li, Zihua Lan, Lv Qingsong, Yinghui Li, Hai-Tao Zheng
Title: MDIT: A Model-free Data Interpolation Method for Diverse Instruction Tuning
Abstract:
As Large Language Models (LLMs) are increasingly applied across various tasks, instruction tuning has emerged as a critical method for enhancing model performance. However, current data management strategies face substantial challenges in generating diverse and comprehensive data, restricting further improvements in model performance. To address this gap, we propose MDIT, a novel model-free data interpolation method for diverse instruction tuning, which generates varied and high-quality instruction data by performing task interpolation. Moreover, it contains diversity-based clustering strategies to ensure the diversity of the training data. Extensive experiments show that our method achieves superior performance in multiple benchmark tasks. The LLMs finetuned with MDIT show significant improvements in numerous tasks such as general question answering, math reasoning, and code generation. MDIT offers an efficient and automatic data synthetic method, generating diverse instruction data without depending on external resources while expanding the application potential of LLMs in complex environments.
Chinese: MDIT提出了一种无模型数据插值方法,通过任务插值和聚类策略生成多样化的高质量指令数据,无需外部资源即可显著提升大语言模型在问答、数学推理和代码生成等多任务中的性能。
English: MDIT introduces a model-free data interpolation method that enhances instruction tuning by generating diverse and high-quality data through task interpolation and clustering, significantly improving LLM performance across tasks like question answering and code generation without external resources.

Authors:Lv Qingsong, Yangning Li, Zihua Lan, Zishan Xu, Jiwei Tang, Yinghui Li, Wenhao Jiang, Hai-Tao Zheng, Philip S. Yu
Title: RAISE: Reinforced Adaptive Instruction Selection For Large Language Models
Abstract:
In the instruction fine-tuning of large language models (LLMs), it is widely recognized that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. Therefore, we design a dynamic, task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.
中文: RAISE框架通过强化学习在微调过程中动态选择指令,以优化特定任务的性能,仅用1%的训练步骤即可达到优于全数据训练的效果。
English: The RAISE framework dynamically selects instructions during fine-tuning using reinforcement learning to optimize task-specific performance, achieving superior results with only 1% of training steps compared to full-data training.

Authors:Kyoungjun Park, Zhiyuan He, Cheng Luo, Yi Xu, Lili Qiu, Changhan Ge, Muhammad Muaz, Yuqing Yang
Title: Joint Optimization of Handoff and Video Rate in LEO Satellite Networks
Abstract:
Low Earth Orbit (LEO) satellite communication presents a promising solution for delivering Internet access to users in remote regions. Given that video content is expected to dominate network traffic in LEO satellite systems, this study presents a new video-aware mobility management framework specifically designed for such networks. By combining simulation models with real-world datasets, we highlight the critical role of handoff strategies and throughput prediction algorithms in both single-user and multi-user video streaming scenarios. Building on these insights, we introduce a suite of innovative algorithms that jointly determine satellite selection and video bitrate to enhance users' quality of experience (QoE). Initially, we design model predictive control (MPC) and reinforcement learning (RL) based methods for individual users, then extend the approach to manage multiple users sharing a satellite. Notably, we incorporate centralized training with distributed inference in our RL design to develop distributed policies informed by a global view. The effectiveness of our approach is validated through trace-driven simulations and testbed experiments.
中文: 本研究针对低地球轨道卫星网络提出了一种新型视频感知移动管理框架,通过结合卫星选择和视频码率的创新算法,在单用户和多用户场景下有效提升了用户体验质量,并经过仿真与实验验证。
English: This study introduces a novel video-aware mobility management framework for LEO satellite networks, featuring innovative algorithms that optimize satellite selection and video bitrate to enhance user QoE through both single-user and multi-user approaches validated by simulations and experiments.

Authors:Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu
Title: SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Sequence Analysis
Abstract:
Genome sequence analysis, which analyzes the DNA sequences of organisms, drives advances in many critical medical and biotechnological fields. Given its importance and the exponentially growing volumes of genomic sequence data, there are extensive efforts to accelerate genome sequence analysis. In this work, we demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome sequence analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored in compressed form and needs to be decompressed and formatted first before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic datasets to co-design (i) a new (de)compression algorithm, (ii) hardware that decompresses data with lightweight operations and efficient streaming accesses, (iii) storage data layout, and (iv) interface commands to access data. SAGe is highly versatile as it supports datasets from different sequencing technologies and species. Thanks to its lightweight design, SAGe can be seamlessly integrated with a broad range of genome sequence analysis hardware accelerators to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome sequence analysis accelerators by 3.0x-32.1x and 13.0x-34.0x, respectively, compared to when the accelerators rely on state-of-the-art decompression tools.
中文: 本研究揭示了基因组序列分析加速器中数据准备环节的瓶颈,并提出SAGe这一算法-架构协同设计方案,通过轻量级硬件集成在保持高压缩率的同时,显著提升了处理性能与能效。
English: This work identifies the data preparation bottleneck in genome sequence analysis accelerators and introduces SAGe, an algorithm-architecture co-design that enhances performance and energy efficiency while maintaining high compression ratios through lightweight hardware integration.

Authors:Manos Frouzakis, Juan Gómez-Luna, Geraldo F. Oliveira, Mohammad Sadrosadati, Onur Mutlu
Title: PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
Abstract:
Database Management Systems (DBMSs) are crucial for efficient data management and analytics, and are used in several different application domains. Due to the increasing volume of data a DBMS deals with, current processor-centric architectures (e.g., CPUs, GPUs) suffer from data movement bottlenecks when executing key DBMS operations (e.g., selection, aggregation, ordering, and join). This happens mostly due to the limited memory bandwidth between compute and memory resources. Data-centric architectures like Processing-in-Memory (PIM) are a promising alternative for applications bottlenecked by data, placing compute resources close to where data resides. Previous works have evaluated using PIM for data analytics. However, they either do not use real-world architectures or they consider only a subset of the operators used in analytical queries. This work aims to fully evaluate a data-centric approach to data analytics, by using the real-world UPMEM PIM system. To this end we first present the PIM Data Analytics Library (PIMDAL), which implements four major DB operators: selection, aggregation, ordering and join. Second, we use hardware performance metrics to understand which properties of a PIM system are important for a high-performance implementation. Third, we compare PIMDAL to reference implementations on high-end CPU and GPU systems. Fourth, we use PIMDAL to implement five TPC-H queries to gain insights into analytical queries. We analyze and show how to overcome the three main limitations of the UPMEM system when implementing DB operators: (I) low arithmetic performance, (II) explicit memory management and (III) limited communication between compute units. Our evaluation shows PIMDAL achieves 3.9x the performance of a high-end CPU, on average across the five TPC-H queries.
中文: 本研究提出的PIMDAL内存计算数据分析库突破了硬件限制,在数据库操作中实现了比高端CPU平均3.9倍的性能提升。
English: This study introduces PIMDAL, a Processing-in-Memory data analytics library that overcomes hardware limitations to achieve a 3.9x performance gain over high-end CPUs in database operations.

Authors:Asier Bikandi, Muhammad Shaheer, Hriday Bavle, Jayan Jevanesan, Holger Voos, Jose Luis Sanchez-Lopez
Title: BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
Abstract:
Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
中文: 本文提出了一种BIM感知的漂移校正方法,通过优化技术将现实环境检测到的平面与BIM中的建筑平面对齐,显著降低了施工环境中AR可视化的漂移误差。
English: This paper introduces a BIM-aware drift correction method that aligns real-world detected planes with BIM's architectural planes through optimization, significantly reducing drift errors and improving AR visualization accuracy in construction environments.

Authors:Asier Bikandi-Noya, Muhammad Shaheer, Hriday Bavle, Jayan Jevanesan, Holger Voos, Jose Luis Sanchez-Lopez
Title: BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
Abstract:
Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
中文: 本文提出了一种BIM感知的漂移校正方法,通过优化技术将现实环境检测到的平面与BIM中的建筑平面对齐,显著降低了施工环境中AR可视化的漂移误差。
English: This paper introduces a BIM-aware drift correction method that aligns real-world detected planes with BIM's architectural planes through optimization, significantly reducing drift errors and improving AR visualization accuracy in construction environments.

Authors:Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Kai Wang, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Xiaolong Jin, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Liang Pang, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu
Title: A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
Abstract:
The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full-stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.
中文: 本文首次提出"全栈"安全概念,系统性地覆盖大语言模型从训练到商业化的全生命周期安全问题,通过全面文献综述提供了独特的研究方向指导。
English: This paper introduces a "full-stack" safety framework to address security concerns throughout the entire lifecycle of Large Language Models, offering comprehensive coverage, extensive literature support, and unique research directions.

Authors:Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: A Survey on (M)LLM-Based GUI Agents
Abstract:
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.
中文摘要:本综述系统分析了基于大语言模型的图形用户界面智能体,剖析其感知、探索、规划与交互四大核心组件,揭示人工智能技术如何推动界面自动化革新,并探讨当前挑战与未来发展方向。
English Summary: This survey comprehensively examines LLM-based GUI Agents, analyzing their core components—perception, exploration, planning, and interaction—and highlighting how advances in AI have revolutionized interface automation while addressing current challenges and future directions.

Authors:Zhengyu Wu, Boyang Pang, Xunkai Li, Yinlin Zhu, Daohan Su, Bowen Fan, Rong-Hua Li, Guoren Wang, Chenghu Zhou
Title: Towards Unbiased Federated Graph Learning: Label and Topology Perspectives
Abstract:
Federated Graph Learning (FGL) enables privacy-preserving, distributed training of graph neural networks without sharing raw data. Among its approaches, subgraph-FL has become the dominant paradigm, with most work focused on improving overall node classification accuracy. However, these methods often overlook fairness due to the complexity of node features, labels, and graph structures. In particular, they perform poorly on nodes with disadvantaged properties, such as being in the minority class within subgraphs or having heterophilous connections (neighbors with dissimilar labels or misleading features). This reveals a critical issue: high accuracy can mask degraded performance on structurally or semantically marginalized nodes. To address this, we advocate for two fairness goals: (1) improving representation of minority class nodes for class-wise fairness and (2) mitigating topological bias from heterophilous connections for topology-aware fairness. We propose FairFGL, a novel framework that enhances fairness through fine-grained graph mining and collaborative learning. On the client side, the History-Preserving Module prevents overfitting to dominant local classes, while the Majority Alignment Module refines representations of heterophilous majority-class nodes. The Gradient Modification Module transfers minority-class knowledge from structurally favorable clients to improve fairness. On the server side, FairFGL uploads only the most influenced subset of parameters to reduce communication costs and better reflect local distributions. A cluster-based aggregation strategy reconciles conflicting updates and curbs global majority dominance . Extensive evaluations on eight benchmarks show FairFGL significantly improves minority-group performance , achieving up to a 22.62 percent Macro-F1 gain while enhancing convergence over state-of-the-art baselines.
中文摘要:联邦图学习常为追求准确性而忽视公平性,而FairFGL框架通过细粒度图挖掘和协同学习,提升少数类节点表征并缓解拓扑偏差,显著改善了边缘节点的性能表现。
English Summary: Federated Graph Learning often sacrifices fairness for accuracy, but FairFGL addresses this by enhancing minority node representation and mitigating topological bias through fine-grained graph mining and collaborative learning, achieving significant performance gains.

Authors:Mert Asim Karaoglu, Wenbo Ji, Ahmed Abbas, Nassir Navab, Benjamin Busam, Alexander Ladikos
Title: LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking
Abstract:
Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room.
Chinese: LiteTracker是一种用于内窥镜视频的低延迟组织追踪方法,其运行速度比前代快7倍,同时保持高精度和遮挡预测能力,适用于实时手术应用。
English: LiteTracker is a low-latency method for tissue tracking in endoscopic videos that achieves 7x faster runtime than its predecessor while maintaining high accuracy and occlusion prediction, making it suitable for real-time surgical applications.

Authors:Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal
Title: Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems
Abstract:
Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from reinforcement learning (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for mathematical reasoning as problem generators for stress-testing models. However, prior work has been limited to automatically constructing abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced mathematics problems by developing EFAGen, which operationalizes the task of automatically inferring an EFA for a given seed problem and solution as a program synthesis task. We first formalize the properties of any valid EFA as executable unit tests. Using execution feedback from the unit tests, we search over candidate programs sampled from a LLM to find EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. We then apply the tests as a reward signal, training LLMs to become better writers of EFAs. We show that EFAs inferred by EFAGen are faithful to the seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across diverse sources of competition-level math problems. Finally, we show uses of model-written EFAs e.g., finding harder/easier problem variants, as well as data generation.
中文: 科学家开发了EFAGen系统,能够自动为高等数学问题生成可执行功能抽象(EFA),从而创建多样化的问题变体,并应用于数据生成和难度调整等场景。
English: Scientists develop EFAGen to automatically create Executable Functional Abstractions (EFAs) for advanced math problems, enabling the generation of diverse problem variations and applications such as data generation and difficulty adjustment.

Authors:Zhengyu Wu, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang, Chenghu Zhou
Title: Federated Prototype Graph Learning
Abstract:
In recent years, Federated Graph Learning (FGL) has gained significant attention for its distributed training capabilities in graph-based machine intelligence applications, mitigating data silos while offering a new perspective for privacy-preserve large-scale graph learning. However, multi-level FGL heterogeneity presents various client-server collaboration challenges: (1) Model-level: The variation in clients for expected performance and scalability necessitates the deployment of heterogeneous models. Unfortunately, most FGL methods rigidly demand identical client models due to the direct model weight aggregation on the server. (2) Data-level: The intricate nature of graphs, marked by the entanglement of node profiles and topology, poses an optimization dilemma. This implies that models obtained by federated training struggle to achieve superior performance. (3) Communication-level: Some FGL methods attempt to increase message sharing among clients or between clients and the server to improve training, which inevitably leads to high communication costs. In this paper, we propose FedPG as a general prototype-guided optimization method for the above multi-level FGL heterogeneity. Specifically, on the client side, we integrate multi-level topology-aware prototypes to capture local graph semantics. Subsequently, on the server side, leveraging the uploaded prototypes, we employ topology-guided contrastive learning and personalized technology to tailor global prototypes for each client, broadcasting them to improve local training. Experiments demonstrate that FedPG outperforms SOTA baselines by an average of 3.57\% in accuracy while reducing communication costs by 168x.
中文: 联邦图学习面临模型、数据和通信层面的异构性挑战,而FedPG通过原型引导优化方法有效提升精度并大幅降低通信开销。
English: Federated Graph Learning (FGL) faces challenges from multi-level heterogeneity, including model, data, and communication issues, which FedPG addresses by using prototype-guided optimization to enhance accuracy and drastically reduce communication costs.

Authors:Claudio Cimarelli, Jose Andres Millan-Romera, Holger Voos, Jose Luis Sanchez-Lopez
Title: Hardware, Algorithms, and Applications of the Neuromorphic Vision Sensor: a Review
Abstract:
Neuromorphic, or event, cameras represent a transformation in the classical approach to visual sensing encodes detected instantaneous per-pixel illumination changes into an asynchronous stream of event packets. Their novelty compared to standard cameras lies in the transition from capturing full picture frames at fixed time intervals to a sparse data format which, with its distinctive qualities, offers potential improvements in various applications. However, these advantages come at the cost of reinventing algorithmic procedures or adapting them to effectively process the new data format. In this survey, we systematically examine neuromorphic vision along three main dimensions. First, we highlight the technological evolution and distinctive hardware features of neuromorphic cameras from their inception to recent models. Second, we review image processing algorithms developed explicitly for event-based data, covering key works on feature detection, tracking, and optical flow -which form the basis for analyzing image elements and transformations -as well as depth and pose estimation or object recognition, which interpret more complex scene structures and components. These techniques, drawn from classical computer vision and modern data-driven approaches, are examined to illustrate the breadth of applications for event-based cameras. Third, we present practical application case studies demonstrating how event cameras have been successfully used across various industries and scenarios. Finally, we analyze the challenges limiting widespread adoption, identify significant research gaps compared to standard imaging techniques, and outline promising future directions and opportunities that neuromorphic vision offers.
中文: 神经形态相机通过捕获稀疏的异步事件数据而非完整帧来革新视觉感知,虽具应用优势但需新算法支持;本综述从硬件演进、专用处理技术到实际应用系统性地探讨了该领域,同时分析了当前挑战与未来方向。
English: Neuromorphic cameras revolutionize visual sensing by capturing sparse, asynchronous event data instead of full frames, offering application advantages but requiring new algorithmic approaches, as this survey explores through hardware evolution, specialized processing techniques, and real-world implementations while addressing adoption challenges.

Authors:Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
Title: VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Abstract:
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
中文: 本文采用GRPO强化微调方法开发了VideoChat-R1视频多模态大模型,在保持对话能力的同时实现了时空感知任务的性能突破,并展现出新兴的时空推理能力。
English: This paper introduces Reinforcement Fine-Tuning (RFT) with GRPO to develop VideoChat-R1, a video MLLM that achieves state-of-the-art spatio-temporal perception while maintaining chat capabilities and showing emerging reasoning abilities.

Authors:Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
Title: VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Abstract:
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
中文: 本文采用GRPO强化微调方法开发了VideoChat-R1视频多模态大模型,在保持对话能力的同时实现了时空感知任务的性能突破,并展现出新兴的时空推理能力。
English: This paper introduces Reinforcement Fine-Tuning (RFT) with GRPO to develop VideoChat-R1, a video MLLM that achieves state-of-the-art spatio-temporal perception while maintaining chat capabilities and showing emerging reasoning abilities.

Authors:Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
Title: OmniSVG: A Unified Scalable Vector Graphics Generation Model
Abstract:
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
中文摘要:OmniSVG是一种创新框架,利用预训练的视觉语言模型高效生成高质量复杂SVG,并基于大规模多模态数据集,在性能上超越现有方法。
English Summary: OmniSVG is a novel framework that uses pre-trained vision-language models to generate high-quality, complex SVGs efficiently, supported by a large multimodal dataset and outperforming existing methods.

Authors:Zihao Zhang, Xunkai Li, Rong-Hua Li, Bing Zhou, Zhenjun Li, Guoren Wang
Title: Toward General and Robust LLM-enhanced Text-attributed Graph Learning
Abstract:
Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance. To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12\% and 17.47\% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.
中文摘要:提出的UltraTAG框架通过统一架构解决了LLM增强的文本属性图学习中的挑战,其实现方案UltraTAG-S采用基于大语言的文本传播增强和边重构策略,有效缓解现实场景中的稀疏性问题,实验证明其性能显著优于现有方法。
English Summary: The proposed UltraTAG framework addresses challenges in LLM-enhanced text-attributed graph learning by providing a unified structure and introducing UltraTAG-S, which effectively mitigates text and edge sparsity through innovative techniques, demonstrating significant performance improvements in experiments.

Authors:Abdelrahman Elskhawy, Mengze Li, Nassir Navab, Benjamin Busam
Title: PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
Abstract:
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. This facilitates image-based understanding and reasoning for various downstream tasks. Although fully supervised SGG approaches showed steady performance improvements, they suffer from a severe training bias. This is caused by the availability of only small subsets of curated data and exhibits long-tail predicate distribution issues with a lack of predicate diversity adversely affecting downstream tasks. To overcome this, we introduce PRISM-0, a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach to capture the whole spectrum of diverse, open-vocabulary predicate prediction. Detected object pairs are filtered and passed to a Vision Language Model (VLM) that generates descriptive captions. These are used to prompt an LLM to generate fine-andcoarse-grained predicates for the pair. The predicates are then validated using a VQA model to provide a final SGG. With the modular and dataset-independent PRISM-0, we can enrich existing SG datasets such as Visual Genome (VG). Experiments illustrate that PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval with a performance on par to the best fully supervised methods.
中文摘要:PRISM-0框架通过自底向上整合基础模型,利用视觉语言模型生成描述并借助大语言模型创建细粒度谓词,有效解决了场景图生成中的训练偏差和谓词多样性不足问题,其生成的语义丰富图谱在多项下游任务中达到与全监督方法相当的性能。
English Summary: The PRISM-0 framework addresses training bias and limited predicate diversity in Scene Graph Generation by leveraging foundation models to create open-vocabulary predicates through a multi-step process involving vision-language models and LLMs, achieving performance comparable to supervised methods while enhancing downstream applications.

Authors:Enjun Du, Xunkai Li, Tian Jin, Zhihan Zhang, Rong-Hua Li, Guoren Wang
Title: GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments
Abstract:
The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes: a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce GraphMaster, the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited "Sub" variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.
中文摘要:GraphMaster提出了首个多智能体框架,通过四个专业LLM代理的协同优化,在数据受限环境下实现了语义连贯与结构完整的图数据合成,显著超越了传统方法并推动了图基础模型的发展。
English Summary: GraphMaster introduces a multi-agent LLM framework that overcomes data scarcity and semantic limitations in graph synthesis by ensuring structural integrity and semantic coherence through iterative refinement, significantly outperforming traditional methods.

Authors:Jinfeng Zhou, Yuxuan Chen, Jianing Yin, Yongkang Huang, Yihan Shi, Xikun Zhang, Libiao Peng, Rongsheng Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang
Title: Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues
Abstract:
Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual's negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.
中文摘要:本研究提出CRDial框架,通过多轮对话的专门阶段和策略改进认知重构,并利用生成的双语数据集Crisp训练出高效的心理治疗对话模型。
English Summary: The study introduces CRDial, a framework that enhances Cognitive Restructuring through multi-turn dialogues with specialized stages and strategies, and develops Crisp, a bilingual dataset used to train effective conversational models for psychotherapy.

Authors:Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Hengle Ren, Renjing Xu, Jian Tang
Title: RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots
Abstract:
3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits the description of the scene by 3D Gaussians. In this paper, we propose a 3D occupancy prediction method which enhances the geometric and semantic scene understanding for robots, dubbed RoboOcc. It utilizes the Opacity-guided Self-Encoder (OSE) to alleviate the semantic ambiguity of overlapping Gaussians and the Geometry-aware Cross-Encoder (GCE) to accomplish the fine-grained geometric modeling of the surrounding scene. We conduct extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet datasets, and our RoboOcc achieves state-of the-art performance in both local and global camera settings. Further, in ablation studies of Gaussian parameters, the proposed RoboOcc outperforms the state-of-the-art methods by a large margin of (8.47, 6.27) in IoU and mIoU metric, respectively. The codes will be released soon.
Chinese: 提出的RoboOcc方法通过引入透明度引导自编码器解决语义模糊性和几何感知交叉编码器实现精细几何建模,显著提升了机器人三维占据预测性能,在基准数据集上达到最优水平。
English: The proposed RoboOcc method enhances 3D occupancy prediction for robots by introducing an Opacity-guided Self-Encoder to resolve semantic ambiguity and a Geometry-aware Cross-Encoder for detailed geometric modeling, achieving state-of-the-art performance on benchmark datasets.

Authors:Liang-bo Ning, Shijie Wang, Wenqi Fan, Qing Li, Xin Xu, Hao Chen, Feiran Huang
Title: CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
Abstract:
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
中文: 本文提出CheatAgent攻击框架,利用大语言模型的类人推理能力为黑盒推荐系统生成对抗性扰动,并通过大量实验证明其高效性。
English: This paper introduces CheatAgent, a novel attack framework that leverages large language models' human-like reasoning to generate adversarial perturbations for black-box recommender systems, demonstrating high effectiveness through extensive experiments.

Authors:Zeyu Dai, Shengcai Liu, Rui He, Jiahao Wu, Ning Lu, Wenqi Fan, Qing Li, Ke Tang
Title: SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models
Abstract:
Unrestricted adversarial examples (UAEs), allow the attacker to create non-constrained adversarial examples without given clean samples, posing a severe threat to the safety of deep learning models. Recent works utilize diffusion models to generate UAEs. However, these UAEs often lack naturalness and imperceptibility due to simply optimizing in intermediate latent noises. In light of this, we propose SemDiff, a novel unrestricted adversarial attack that explores the semantic latent space of diffusion models for meaningful attributes, and devises a multi-attributes optimization approach to ensure attack success while maintaining the naturalness and imperceptibility of generated UAEs. We perform extensive experiments on four tasks on three high-resolution datasets, including CelebA-HQ, AFHQ and ImageNet. The results demonstrate that SemDiff outperforms state-of-the-art methods in terms of attack success rate and imperceptibility. The generated UAEs are natural and exhibit semantically meaningful changes, in accord with the attributes' weights. In addition, SemDiff is found capable of evading different defenses, which further validates its effectiveness and threatening.
中文:SemDiff提出了一种新颖的无限制对抗攻击方法,通过在扩散模型中优化语义属性来生成自然且不易察觉的对抗样本,在攻击成功率和规避防御能力方面均优于现有最优方法。
English: SemDiff introduces a novel unrestricted adversarial attack that optimizes semantic attributes in diffusion models to generate natural and imperceptible adversarial examples, outperforming state-of-the-art methods in attack success and evasion capabilities.

Authors:Liangbo Ning, Wenqi Fan, Qing Li
Title: Exploring Backdoor Attack and Defense for LLM-empowered Recommendations
Abstract:
The fusion of Large Language Models (LLMs) with recommender systems (RecSys) has dramatically advanced personalized recommendations and drawn extensive attention. Despite the impressive progress, the safety of LLM-based RecSys against backdoor attacks remains largely under-explored. In this paper, we raise a new problem: Can a backdoor with a specific trigger be injected into LLM-based Recsys, leading to the manipulation of the recommendation responses when the backdoor trigger is appended to an item's title? To investigate the vulnerabilities of LLM-based RecSys under backdoor attacks, we propose a new attack framework termed Backdoor Injection Poisoning for RecSys (BadRec). BadRec perturbs the items' titles with triggers and employs several fake users to interact with these items, effectively poisoning the training set and injecting backdoors into LLM-based RecSys. Comprehensive experiments reveal that poisoning just 1% of the training data with adversarial examples is sufficient to successfully implant backdoors, enabling manipulation of recommendations. To further mitigate such a security threat, we propose a universal defense strategy called Poison Scanner (P-Scanner). Specifically, we introduce an LLM-based poison scanner to detect the poisoned items by leveraging the powerful language understanding and rich knowledge of LLMs. A trigger augmentation agent is employed to generate diverse synthetic triggers to guide the poison scanner in learning domain-specific knowledge of the poisoned item detection task. Extensive experiments on three real-world datasets validate the effectiveness of the proposed P-Scanner.
中文: 本研究提出了BadRec后门攻击框架,仅需污染1%的训练数据即可操控基于大语言模型的推荐系统,并设计了P-Scanner防御策略,利用大语言模型的能力有效检测被污染项目。
English: The study introduces BadRec, a backdoor attack framework that can compromise LLM-based recommender systems by poisoning just 1% of training data, and proposes P-Scanner, an LLM-based defense strategy to detect such threats effectively.

Authors:Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen
Title: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Abstract:
Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.
中文: 本文提出PALLE新型语音合成系统,采用伪自回归建模融合了自回归模型的时间建模优势和非自回归模型的并行生成能力,在保持高质量语音合成的同时实现了比现有最优系统快十倍的推理速度。
English: This paper introduces PALLE, a novel two-stage text-to-speech system using pseudo-autoregressive modeling that combines the temporal accuracy of autoregressive models with the speed of non-autoregressive approaches, achieving superior speech quality and up to ten times faster inference than current state-of-the-art systems.

Authors:Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, Dong Li, Can Chen, Dandan Tu, Yin Li, Fisher Yu, Ruiming Tang, Yunhe Wang, Baojun Wang, Bin Wang, Bo Wang, Boxiao Liu, Changzheng Zhang, Duyu Tang, Fei Mi, Hui Jin, Jiansheng Wei, Jiarui Qin, Jinpeng Li, Jun Zhao, Liqun Deng, Lin Li, Minghui Xu, Naifu Zhang, Nianzu Zheng, Qiang Li, Rongju Ruan, Shengjun Cheng, Tianyu Guo, Wei He, Wei Li, Weiwen Liu, Wulong Liu, Xinyi Dai, Yonghan Dong, Yu Pan, Yue Li, Yufei Wang, Yujun Li, Yunsheng Ni, Zhe Liu, Zhenhe Zhang, Zhicheng Liu
Title: Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs
Abstract:
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.
中文: 盘古Ultra是一个拥有1350亿参数的稠密Transformer模型,通过深度缩放三明治归一化技术稳定训练过程,在多项基准测试中达到领先性能,并验证了昇腾NPUs高效训练超大规模模型的能力。
English: Pangu Ultra is a 135-billion-parameter dense Transformer model trained on Ascend NPUs, utilizing depth-scaled sandwich normalization to stabilize training and achieving state-of-the-art performance on benchmarks while demonstrating efficient large-scale training capabilities.

Authors:Yao Tao, Yehui Tang, Yun Wang, Mingjian Zhu, Hailin Hu, Yunhe Wang
Title: Saliency-driven Dynamic Token Pruning for Large Language Models
Abstract:
Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.
中文摘要:提出的显著性驱动动态令牌剪枝(SDTP)框架通过重要性评分动态移除大语言模型中的冗余令牌,在保持性能的同时实现了高达47%的FLOPs减少和1.75倍加速。
English Summary: The proposed Saliency-driven Dynamic Token Pruning (SDTP) framework dynamically removes redundant tokens from LLM inputs using importance scoring, achieving up to 47% FLOPs reduction and 1.75× speedup while maintaining performance.

Authors:Zihuai Zhao, Wenqi Fan, Yao Wu, Qing Li
Title: Investigating and Mitigating Stereotype-aware Unfairness in LLM-based Recommendations
Abstract:
Large Language Models (LLMs) have demonstrated unprecedented language understanding and reasoning capabilities to capture diverse user preferences and advance personalized recommendations. Despite the growing interest in LLM-based recommendations, unique challenges are brought to the trustworthiness of LLM-based recommender systems (LLM-RS). Compared to unique user/item representations in conventional recommender systems, users and items share the textual representation (e.g., word embeddings) in LLM-based recommendations. Recent studies have revealed that LLMs are likely to inherit stereotypes that are embedded ubiquitously in word embeddings, due to their training on large-scale uncurated datasets. This leads to LLM-RS exhibiting stereotypical linguistic associations between users and items, causing a form of two-sided (i.e., user-to-item) recommendation fairness. However, there remains a lack of studies investigating the unfairness of LLM-RS due to intrinsic stereotypes, which can simultaneously involve user and item groups. To bridge this gap, this study reveals a new variant of fairness between stereotype groups containing both users and items, to quantify discrimination against stereotypes in LLM-RS. Moreover, in this paper, to mitigate stereotype-aware unfairness in textual user and item representations, we propose a novel framework named Mixture-of-Stereotypes (MoS). In particular, an insightful stereotype-wise routing strategy over multiple stereotype-relevant experts is designed, aiming to learn unbiased representations against different stereotypes in LLM-RS. Extensive experiments are conducted to analyze the influence of stereotype-aware fairness in LLM-RS and the effectiveness of our proposed methods, which consistently outperform competitive benchmarks under various fairness settings.
大型语言模型提升了个性化推荐能力,但因其从词嵌入中继承的刻板印象而面临可信度挑战,本研究通过提出混合刻板印象框架来缓解这种不公平性,并提升公平性表现。
Large language models enhance personalized recommendations but face trustworthiness challenges due to inherited stereotypes from word embeddings, which this study addresses by proposing a Mixture-of-Stereotypes framework to mitigate unfairness and improve fairness performance.

Authors:Liangbo Ning, Wenqi Fan, Qing Li
Title: Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Abstract:
Recently, Large Language Model (LLM)-empowered recommender systems have revolutionized personalized recommendation frameworks and attracted extensive attention. Despite the remarkable success, existing LLM-empowered RecSys have been demonstrated to be highly vulnerable to minor perturbations. To mitigate the negative impact of such vulnerabilities, one potential solution is to employ collaborative signals based on item-item co-occurrence to purify the malicious collaborative knowledge from the user's historical interactions inserted by attackers. On the other hand, due to the capabilities to expand insufficient internal knowledge of LLMs, Retrieval-Augmented Generation (RAG) techniques provide unprecedented opportunities to enhance the robustness of LLM-empowered recommender systems by introducing external collaborative knowledge. Therefore, in this paper, we propose a novel framework (RETURN) by retrieving external collaborative signals to purify the poisoned user profiles and enhance the robustness of LLM-empowered RecSys in a plug-and-play manner. Specifically, retrieval-augmented perturbation positioning is proposed to identify potential perturbations within the users' historical sequences by retrieving external knowledge from collaborative item graphs. After that, we further retrieve the collaborative knowledge to cleanse the perturbations by using either deletion or replacement strategies and introduce a robust ensemble recommendation strategy to generate final robust predictions. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed RETURN.
中文:提出的RETURN框架通过检索增强技术,利用外部协同知识识别并净化用户档案中的恶意扰动,从而增强基于大语言模型的推荐系统的鲁棒性。
English: The proposed RETURN framework enhances the robustness of LLM-based recommender systems by using retrieval-augmented techniques to identify and purify malicious perturbations in user profiles through external collaborative knowledge.

Authors:Clark Mingxuan Ju, Leonardo Neves, Bhuvesh Kumar, Liam Collins, Tong Zhao, Yuwei Qiu, Qing Dou, Yang Zhou, Sohail Nizam, Rengim Ozturk, Yvette Liu, Sen Yang, Manish Malik, Neil Shah
Title: Learning Universal User Representations Leveraging Cross-domain User Intent at Snapchat
Abstract:
The development of powerful user representations is a key factor in the success of recommender systems (RecSys). Online platforms employ a range of RecSys techniques to personalize user experience across diverse in-app surfaces. User representations are often learned individually through user's historical interactions within each surface and user representations across different surfaces can be shared post-hoc as auxiliary features or additional retrieval sources. While effective, such schemes cannot directly encode collaborative filtering signals across different surfaces, hindering its capacity to discover complex relationships between user behaviors and preferences across the whole platform. To bridge this gap at Snapchat, we seek to conduct universal user modeling (UUM) across different in-app surfaces, learning general-purpose user representations which encode behaviors across surfaces. Instead of replacing domain-specific representations, UUM representations capture cross-domain trends, enriching existing representations with complementary information. This work discusses our efforts in developing initial UUM versions, practical challenges, technical choices and modeling and research directions with promising offline performance. Following successful A/B testing, UUM representations have been launched in production, powering multiple use cases and demonstrating their value. UUM embedding has been incorporated into (i) Long-form Video embedding-based retrieval, leading to 2.78% increase in Long-form Video Open Rate, (ii) Long-form Video L2 ranking, with 19.2% increase in Long-form Video View Time sum, (iii) Lens L2 ranking, leading to 1.76% increase in Lens play time, and (iv) Notification L2 ranking, with 0.87% increase in Notification Open Rate.
Chinese: Snapchat开发了通用用户建模(UUM),通过跨界面学习用户表征来增强现有推荐系统,在A/B测试成功后已在多个功能中实现显著性能提升。
English: Snapchat developed universal user modeling (UUM) to create cross-surface user representations that enhance existing systems, achieving significant performance improvements across multiple features after successful A/B testing.

Authors:Yixin Gao, Xiaohan Pan, Xin Li, Zhibo Chen
Title: Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields
Abstract:
The rapid development of AIGC foundation models has revolutionized the paradigm of image compression, which paves the way for the abandonment of most pixel-level transform and coding, compelling us to ask: why compress what you can generate if the AIGC foundation model is powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than some compact descriptors, i.e., texts, or cues. Fortunately, recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities, which motivates us to answer the above question by exploring its potential in image compression fields. In this work, we investigate two typical compression paradigms: textual coding and multimodal coding (i.e., text + extremely low-resolution image), where all/most pixel-level information is generated instead of compressing via the advanced GPT-4o image generation function. The essential challenge lies in how to maintain semantic and structure consistency during the decoding process. To overcome this, we propose a structure raster-scan prompt engineering mechanism to transform the image into textual space, which is compressed as the condition of GPT-4o image generation. Extensive experiments have shown that the combination of our designed structural raster-scan prompts and GPT-4o's image generation function achieved the impressive performance compared with recent multimodal/generative image compression at ultra-low bitrate, further indicating the potential of AIGC generation in image compression fields.
中文摘要:本研究利用GPT-4o先进的图像生成功能,通过将图像转换为结构光栅扫描提示来实现图像压缩,在超低码率下相比现有方法取得了更优的性能表现。
English Summary: This study explores the use of GPT-4o's advanced image generation capabilities for image compression by converting images into structural raster-scan prompts, achieving superior performance at ultra-low bitrates compared to existing methods.

Authors:Xin Li, Xijun Wang, Bingchen Li, Kun Yuan, Yizhen Shao, Suhang Yao, Ming Sun, Chao Zhou, Radu Timofte, Zhibo Chen
Title: NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: KwaiSR Dataset and Study
Abstract:
In this work, we build the first benchmark dataset for short-form UGC Image Super-resolution in the wild, termed KwaiSR, intending to advance the research on developing image super-resolution algorithms for short-form UGC platforms. This dataset is collected from the Kwai Platform, which is composed of two parts, i.e., synthetic and wild parts. Among them, the synthetic dataset, including 1,900 image pairs, is produced by simulating the degradation following the distribution of real-world low-quality short-form UGC images, aiming to provide the ground truth for training and objective comparison in the validation/testing. The wild dataset contains low-quality images collected directly from the Kwai Platform, which are filtered using the quality assessment method KVQ from the Kwai Platform. As a result, the KwaiSR dataset contains 1800 synthetic image pairs and 1900 wild images, which are divided into training, validation, and testing parts with a ratio of 8:1:1. Based on the KwaiSR dataset, we organize the NTIRE 2025 challenge on a second short-form UGC Video quality assessment and enhancement, which attracts lots of researchers to develop the algorithm for it. The results of this competition have revealed that our KwaiSR dataset is pretty challenging for existing Image SR methods, which is expected to lead to a new direction in the image super-resolution field. The dataset can be found from https://lixinustc.github.io/NTIRE2025-KVQE-KwaSR-KVQ.github.io/.
中文: 本研究推出了首个短格式用户生成图像超分辨率基准数据集KwaiSR,通过合成与真实数据推动算法发展,并对现有方法构成挑战。
English: This study introduces KwaiSR, the first benchmark dataset for super-resolution of short-form user-generated images, designed to advance algorithm development and challenge existing methods through its synthetic and wild components.

Authors:Jing Zhang, Dan Guo, Zhangbin Li, Meng Wang
Title: EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
Abstract:
This paper focuses on a key challenge in visual emotion understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for it. Despite advances in general segmentation, pixel-level emotion understanding still faces a dual challenge: first, the subjectivity of emotion limits general segmentation models like SAM to adapt to emotion-oriented segmentation tasks; and second, the abstract nature of art expression makes it hard for captioning models to balance pixel-level semantics and emotion reasoning. To solve the above problems, this paper proposes the Emotion stimuli Segmentation and Explanation Model (EmoSEM) model to endow the segmentation framework with emotion comprehension capability. First, to enable the model to perform segmentation under the guidance of emotional intent well, we introduce an emotional prompt with a learnable mask token as the conditional input for segmentation decoding. Then, we design an emotion projector to establish the association between emotion and visual features. Next, more importantly, to address emotion-visual stimuli alignment, we develop a lightweight prefix adapter, a module that fuses the learned emotional mask with the corresponding emotion into a unified representation compatible with the language model. Finally, we input the joint visual, mask, and emotional tokens into the language model and output the emotional explanations. It ensures that the generated interpretations remain semantically and emotionally coherent with the visual stimuli. Our method realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, delivering the first interpretable fine-grained framework for visual emotion analysis. Extensive experiments validate the effectiveness of our model. Code will be made publicly available.
中文: 本文提出EmoSEM模型,通过情感提示和轻量级适配器解决艺术图像中像素级情感理解的双重挑战,实现从像素特征到情感解释的端到端可解释分析。
English: This paper introduces EmoSEM, a model that addresses pixel-level emotion understanding in art images by integrating emotion-guided segmentation and linguistic explanation generation through emotional prompts and a lightweight adapter, achieving end-to-end interpretable emotion analysis.

Authors:Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Title: MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection
Abstract:
Currently, industrial anomaly detection suffers from two bottlenecks: (i) the rarity of real-world defect images and (ii) the opacity of sample quality when synthetic data are used. Existing synthetic strategies (e.g., cut-and-paste) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this paper, we introduce a novel and lightweight pipeline that generates synthetic anomalies through Math-Phys model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator (SQE). By combining physical modeling of the three most typical physics-driven defect mechanisms: Fracture Line (FL), Pitting Loss (PL), and Plastic Warpage (PW), our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our method, we conduct experiments on three anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, our method achieves state-of-the-art results in both image- and pixel-AUROC, confirming the effectiveness of our MaPhC2F dataset and BiSQAD method. All code will be released.
中文: 本文提出了一种新颖的流程,通过数学物理模型引导和从粗到精的方法生成逼真的合成异常,在多个基准测试中取得了最先进的结果,有效解决了现有合成策略的局限性。
English: This paper introduces a novel pipeline using Math-Phys model guidance and a Coarse-to-Fine approach to generate realistic synthetic anomalies, achieving state-of-the-art results on multiple benchmarks by addressing the limitations of existing synthetic strategies.

Authors:Hamza Pehlivan, Andrea Boscolo Camiletto, Lin Geng Foo, Marc Habermann, Christian Theobalt
Title: Second-order Optimization of Gaussian Splats with Importance Sampling
Abstract:
3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a $3\times$ speedup over standard LM and outperforms Adam by $~6\times$ when the Gaussian count is low while remaining competitive for moderate counts. Project Page: https://vcai.mpi-inf.mpg.de/projects/LM-IS
Chinese: 该方法利用Levenberg-Marquardt和共轭梯度的二阶优化策略,通过稀疏性利用和采样技术,在3D高斯泼溅中实现了比Adam快6倍的训练速度,同时保持竞争力。
English: The proposed second-order optimization method using Levenberg-Marquardt and Conjugate Gradient with sparsity exploitation and sampling strategies achieves up to 6× faster training than Adam while maintaining competitive performance for 3D Gaussian Splatting.

Authors:Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou, Qirui Yang, Fangpu Zhang, Yunlong Lin, Sixiang Chen, Guoxi Huang, Ruirui Lin, Yan Zhang, Jingyu Yang, Huanjing Yue, Jiyuan Chen, Qiaosi Yi, Hongjun Wang, Chenxi Xie, Shuai Li, Yuhui Wu, Kaiyi Ma, Jiakui Hu, Juncheng Li, Liwen Pan, Guangwei Gao, Wenjie Li, Zhenyu Jin, Heng Guo, Zhanyu Ma, Yubo Wang, Jinghua Wang, Wangzhi Xing, Anjusree Karnavar, Diqi Chen, Mohammad Aminul Islam, Hao Yang, Ruikun Zhang, Liyuan Pan, Qianhao Luo, XinCao, Han Zhou, Yan Min, Wei Dong, Jun Chen, Taoyi Wu, Weijia Dou, Yu Wang, Shengjie Zhao, Yongcheng Huang, Xingyu Han, Anyan Huang, Hongtao Wu, Hong Wang, Yefeng Zheng, Abhijeet Kumar, Aman Kumar, Marcos V. Conde, Paula Garrido, Daniel Feijoo, Juan C. Benito, Guanglu Dong, Xin Lin, Siyuan Liu, Tianheng Zheng, Jiayu Zhong, Shouyi Wang, Xiangtai Li, Lanqing Guo, Lu Qi, Chao Ren, Shuaibo Wang, Shilong Zhang, Wanyu Zhou, Yunze Wu, Qinzhong Tan, Jieyuan Pei, Zhuoxuan Li, Jiayu Wang, Haoyu Bian, Haoran Sun, Subhajit Paul, Ni Tang, Junhao Huang, Zihan Cheng, Hongyun Zhu, Yuehan Wu, Kaixin Deng, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhizun Luo, Zeyu Xiao, Zhuoyuan Li, Nguyen Pham Hoang Le, An Dinh Thien, Son T. Luu, Kiet Van Nguyen, Ronghua Xu, Xianmin Tian, Weijian Zhou, Jiacheng Zhang, Yuqian Chen, Yihang Duan, Yujie Wu, Suresh Raikwar, Arsh Garg, Kritika, Jianhua Zheng, Xiaoshan Ma, Ruolin Zhao, Yongyu Yang, Yongsheng Liang, Guiming Huang, Qiang Li, Hongbin Zhang, Xiangyu Zheng, A. N. Rajagopalan
Title: NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
中文: NTIRE 2025挑战赛通过构建包含昼夜不同聚焦模式的雨滴清晰数据集,为雨滴去除任务设立了新基准,吸引了361名参赛者,其中32支团队在最终测试中取得了顶尖性能。
English: The NTIRE 2025 Challenge introduced a comprehensive Raindrop Clarity dataset to benchmark raindrop removal techniques under diverse lighting and focus conditions, attracting 361 participants with 32 teams achieving state-of-the-art results.

Authors:Vinay Shukla, Prachee Sharma, Ryan Rossi, Sungchul Kim, Tong Yu, Aditya Grover
Title: WaterFlow: Learning Fast & Robust Watermarks using Stable Diffusion
Abstract:
The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer greatly in their robustness or perceptual quality. In this work, we propose WaterFlow (WF), a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark. Our approach utilizes a pretrained latent diffusion model to encode an arbitrary image into a latent space and produces a learned watermark that is then planted into the Fourier Domain of the latent. The transformation is specified via invertible flow layers that enhance the expressivity of the latent space of the pre-trained model to better preserve image quality while permitting robust and tractable detection. Most notably, WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks. We validate our findings on three widely used real and generated datasets: MS-COCO, DiffusionDB, and WikiArt.
中文:WaterFlow是一种快速且鲁棒的视觉水印方法,通过可逆流层将学习到的潜在相关水印嵌入傅里叶域,在鲁棒性和防御组合攻击方面达到了最先进的性能。
English: WaterFlow is a fast and robust visual watermarking method that uses a learned latent-dependent watermark embedded in the Fourier domain via invertible flow layers, achieving state-of-the-art performance in robustness and defending against combination attacks.

Authors:Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang
Title: PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
Abstract:
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.
中文总结:本报告概述了在CVPR 2025举办的第四届PVUW挑战赛,重点介绍了MOSE和MeViS两个赛道的成果、参与方法及旨在更好反映真实场景的新数据集,为复杂视频分割领域提供了重要见解。
English Summary: This report summarizes the 4th PVUW Challenge at CVPR 2025, detailing the MOSE and MeViS tracks' outcomes, methodologies, and new datasets that advance complex video segmentation research.

Authors:Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Title: Taming Consistency Distillation for Accelerated Human Image Animation
Abstract:
Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.
中文:DanceLCM通过分段一致性蒸馏和运动聚焦损失,在仅需2-4次推理步骤的情况下显著提升人体图像动画质量,大幅降低计算负担的同时保持视频逼真度。
English: DanceLCM enhances human image animation by employing segmented consistency distillation and motion-focused losses to achieve high-quality results with only 2-4 inference steps, drastically reducing computational costs while maintaining video fidelity.

Authors:Zongyue Qin, Shichang Zhang, Mingxuan Ju, Tong Zhao, Neil Shah, Yizhou Sun
Title: Heuristic Methods are Good Teachers to Distill MLPs for Graph Link Prediction
Abstract:
Link prediction is a crucial graph-learning task with applications including citation prediction and product recommendation. Distilling Graph Neural Networks (GNNs) teachers into Multi-Layer Perceptrons (MLPs) students has emerged as an effective approach to achieve strong performance and reducing computational cost by removing graph dependency. However, existing distillation methods only use standard GNNs and overlook alternative teachers such as specialized model for link prediction (GNN4LP) and heuristic methods (e.g., common neighbors). This paper first explores the impact of different teachers in GNN-to-MLP distillation. Surprisingly, we find that stronger teachers do not always produce stronger students: MLPs distilled from GNN4LP can underperform those distilled from simpler GNNs, while weaker heuristic methods can teach MLPs to near-GNN performance with drastically reduced training costs. Building on these insights, we propose Ensemble Heuristic-Distilled MLPs (EHDM), which eliminates graph dependencies while effectively integrating complementary signals via a gating mechanism. Experiments on ten datasets show an average 7.93% improvement over previous GNN-to-MLP approaches with 1.95-3.32 times less training time, indicating EHDM is an efficient and effective link prediction method.
Chinese: 本文提出集成启发式蒸馏多层感知机(EHDM),通过门控机制整合启发式方法的互补信号来训练高效MLP,在链接预测任务中相比传统GNN蒸馏方法实现了性能显著提升与训练时间大幅减少。
English: This paper introduces Ensemble Heuristic-Distilled MLPs (EHDM), a novel distillation method that leverages heuristic techniques to train efficient MLPs for link prediction, achieving significant performance gains and reduced training time compared to traditional GNN-to-MLP approaches.

Authors:Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou
Title: GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
Abstract:
Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.
中文: GenPRM提出了一种生成式过程奖励模型,通过结合代码验证的显式思维链推理来解决现有模型的局限性,在多项任务中显著超越先前方法,为大型语言模型的过程监督建立了新范式。
English: GenPRM introduces a generative process reward model that uses explicit Chain-of-Thought reasoning with code verification to overcome limitations in current process reward models, significantly outperforming prior methods and establishing a new paradigm for process supervision in large language models.

Authors:Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, Yang Gao
Title: Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining
Abstract:
Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks.
Chinese: 提出的信息结构自适应(ISA)方法通过新颖的Fisher评分和渐进式训练,在推理过程中选择性地调整领域特定模型结构,使训练好的域内少样本分割模型能够适应新领域,无需昂贵地重新训练,并在多个基准测试中实现卓越性能。
English: The proposed Informative Structure Adaptation (ISA) method enables well-trained in-domain few-shot segmentation models to adapt to new domains during inference by selectively adjusting domain-specific model structures through a novel Fisher score and progressive training, eliminating the need for costly retraining while achieving superior performance across benchmarks.

Authors:Yanzhe Hu, Shenao Wang, Tianyuan Nie, Yanjie Zhao, Haoyu Wang
Title: Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities
Abstract:
Large Language Models (LLMs) have revolutionized artificial intelligence (AI), driving breakthroughs in natural language understanding, text generation, and autonomous systems. However, the rapid growth of LLMs presents significant challenges in the security and reliability of the Large Language Model Supply Chain (LLMSC), a complex network of open-source components, libraries, and tools essential for LLM development and deployment. Despite its critical importance, the LLMSC remains underexplored, particularly regarding its structural characteristics, domain composition, and security vulnerabilities. To address this gap, we conduct the first empirical study of the LLMSC, analyzing a curated dataset of open-source packages from PyPI and NPM across 14 functional domains. We construct a directed dependency graph comprising 15,725 nodes, 10,402 edges, and 180 unique vulnerabilities to investigate the structural characteristics of the LLMSC and analyze how security risks propagate through its dependency network. Our findings reveal that the LLMSC exhibits a ``locally dense, globally sparse'' topology, with 79.7% of dependency trees containing fewer than 5 nodes, while a few large trees dominate the ecosystem, accounting for 77.66% of all nodes. The graph is characterized by high-degree hubs, with the top 5 most connected nodes averaging 1,282 dependents each. Security analysis shows that critical vulnerabilities propagate to an average of 142.1 nodes at the second layer of dependency trees and peak at 237.8 affected nodes at the third layer. Notably, cascading risks are concentrated in critical hub nodes such as transformers, which directly or indirectly affect over 1,300 downstream packages. These findings provide quantitative insights into the structural and security dynamics of the LLMSC and emphasize the need for targeted mitigation strategies to enhance ecosystem resilience.
中文: 本研究首次对大语言模型供应链进行实证分析,揭示了其"局部密集、全局稀疏"的结构特征,发现关键安全漏洞集中在高连接度的枢纽节点,这些节点可将风险传播至数千个下游软件包。
English: This study presents the first empirical analysis of the Large Language Model Supply Chain (LLMSC), revealing its "locally dense, globally sparse" structure with critical security vulnerabilities concentrated in high-degree hub nodes that propagate risks to thousands of downstream packages.

Authors:Patrick Iff, Maciej Besta, Torsten Hoefler
Title: FoldedHexaTorus: An Inter-Chiplet Interconnect Topology for Chiplet-based Systems using Organic and Glass Substrates
Abstract:
Chiplet-based systems are rapidly gaining traction in the market. Two packaging options for such systems are the established organic substrates and the emerging glass substrates. These substrates are used to implement the inter-chiplet interconnect (ICI), which is crucial for overall system performance. To guide the development of ICIs, we introduce three design principles for ICI network topologies on organic and glass substrates. Based on our design principles, we propose the novel FoldedHexaTorus network topology. Our evaluation shows that the FoldedHexaTorus achieves significantly higher throughput than state-of-the-art topologies while maintaining low latency.
中文: 本研究针对有机和玻璃基板上的小芯片互连网络提出了三项设计原则,并推出了新颖的FoldedHexaTorus拓扑结构,经评估该结构在保持低延迟的同时,比现有先进拓扑实现了显著更高的吞吐量。
English: The study introduces three design principles for inter-chiplet interconnect networks on organic and glass substrates and proposes the novel FoldedHexaTorus topology, which demonstrates significantly higher throughput with low latency compared to existing designs.

Authors:Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, Julian McAuley
Title: In-context Ranking Preference Optimization
Abstract:
Recent developments in Direct Preference Optimization (DPO) allow large language models (LLMs) to function as implicit ranking models by maximizing the margin between preferred and non-preferred responses. In practice, user feedback on such lists typically involves identifying a few relevant items in context rather than providing detailed pairwise comparisons for every possible item pair. Moreover, many complex information retrieval tasks, such as conversational agents and summarization systems, critically depend on ranking the highest-quality outputs at the top, emphasizing the need to support natural and flexible forms of user feedback. To address the challenge of limited and sparse pairwise feedback in the in-context setting, we propose an In-context Ranking Preference Optimization (IRPO) framework that directly optimizes LLMs based on ranking lists constructed during inference. To further capture flexible forms of feedback, IRPO extends the DPO objective by incorporating both the relevance of items and their positions in the list. Modeling these aspects jointly is non-trivial, as ranking metrics are inherently discrete and non-differentiable, making direct optimization difficult. To overcome this, IRPO introduces a differentiable objective based on positional aggregation of pairwise item preferences, enabling effective gradient-based optimization of discrete ranking metrics. We further provide theoretical insights showing that IRPO (i) automatically emphasizes items with greater disagreement between the model and the reference ranking, and (ii) links its gradient to an importance sampling estimator, yielding an unbiased estimator with reduced variance. Empirical results show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
中文:IRPO框架通过引入上下文排名列表和位置相关性改进了直接偏好优化,利用可微分目标有效优化离散排名指标,并在排名性能上展现出优于标准DPO方法的效果。
English: The IRPO framework enhances Direct Preference Optimization by incorporating in-context ranking lists and positional relevance, enabling effective optimization of discrete ranking metrics through a differentiable objective and demonstrating superior performance over standard DPO methods.

Authors:Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov
Title: CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Abstract:
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/
Chinese: CLIMB框架通过语义聚类和迭代评估自动发现并优化预训练数据组合,在1B模型上实现了比现有最优模型高2.0%的性能提升,并在特定领域(如社会科学)取得了优于随机采样5%的效果。
English: The CLIMB framework automates the discovery and refinement of optimal pre-training data mixtures through semantic clustering and iterative evaluation, achieving a 2.0% performance gain over state-of-the-art models and enabling domain-specific improvements of up to 5%.

Authors:Joanne Lin, Crispian Morris, Ruirui Lin, Fan Zhang, David Bull, Nantheera Anantrasirichai
Title: Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline
Abstract:
Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.
中文: 本文提出了一种退化估计网络(DEN),无需相机元数据即可生成逼真的sRGB噪声,通过自监督训练在低光任务(如视频增强和目标检测)中实现了性能提升。
English: The paper introduces a Degradation Estimation Network (DEN) that generates realistic sRGB noise without camera metadata, enabling improved performance in low-light tasks like video enhancement and object detection through self-supervised training.

Authors:Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang, Haibo Lei, Qifang Gao, Yaqing Li, Weihua Luo, Tsing Li, Qing Wang, Yi Liu, Yang Wang, Hongyu An, Liou Zhang, Shijie Zhao, Lianhong Song, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Jing Wei, Mengyang Wang, Ruilong Guo, Qian Wang, Qingliang Liu, Yang Cheng, Davinci, Enxuan Gu, Pinxin Liu, Yongsheng Yu, Hang Hua, Yunlong Tang, Shihao Wang, Yukun Yang, Zhiyu Zhang, Yukun Yang, Jiyu Wu, Jiancheng Huang, Yifan Liu, Yi Huang, Shifeng Chen, Rui Chen, Yi Feng, Mingxi Li, Cailu Wan, Xiangji Wu, Zibin Liu, Jinyang Zhong, Kihwan Yoon, Ganzorig Gankhuyag, Shengyun Zhong, Mingyang Wu, Renjie Li, Yushen Zuo, Zhengzhong Tu, Zongang Gao, Guannan Chen, Yuan Tian, Wenhui Chen, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Yilin Zhang, Huan Zheng, Yanyan Wei, Wenxuan Zhao, Suiyi Zhao, Fei Wang, Kun Li, Yinggan Tang, Mengjie Su, Jae-hyeon Lee, Dong-Hyeop Son, Ui-Jin Choi, Tiancheng Shao, Yuqing Zhang, Mengcheng Ma, Donggeun Ko, Youngsang Kwak, Jiun Lee, Jaehwa Kwak, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jing Hu, Hui Deng, Xuan Zhang, Lin Zhu, Qinrui Fan, Weijian Deng, Junnan Wu, Wenqin Deng, Yuquan Liu, Zhaohong Xu, Jameer Babu Pinjari, Kuldeep Purohit, Zeyu Xiao, Zhuoyuan Li, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Sachin Chaudhary, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Wei-Chen Shen, I-Hsiang Chen, Yunzhe Xu, Chen Zhao, Zhizhou Chen, Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh, Alejandro Merino, Bruno Longarela, Javier Abad, Marcos V. Conde, Simone Bianco, Luca Cogo, Gianmarco Corti
Title: The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
中文: NTIRE 2025高效超分辨率挑战赛成功推动了在保持高图像质量的同时优化计算效率的深度学习模型发展,吸引了244名参与者,并为未来研究确立了创新基准。
English: The NTIRE 2025 Challenge on Efficient Super-Resolution successfully advanced deep learning models that balance computational efficiency with high image quality, drawing 244 participants and yielding innovative benchmarks for future research.

Authors:Yanlin Wang, Kefeng Duan, Dewu Zheng, Ensheng Shi, Fengji Zhang, Yanli Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Hongyu Zhang, Qianxiang Wang, Zibin Zheng
Title: Towards an Understanding of Context Utilization in Code Intelligence
Abstract:
Code intelligence is an emerging domain in software engineering, aiming to improve the effectiveness and efficiency of various code-related tasks. Recent research suggests that incorporating contextual information beyond the basic original task inputs (i.e., source code) can substantially enhance model performance. Such contextual signals may be obtained directly or indirectly from sources such as API documentation or intermediate representations like abstract syntax trees can significantly improve the effectiveness of code intelligence. Despite growing academic interest, there is a lack of systematic analysis of context in code intelligence. To address this gap, we conduct an extensive literature review of 146 relevant studies published between September 2007 and August 2024. Our investigation yields four main contributions. (1) A quantitative analysis of the research landscape, including publication trends, venues, and the explored domains; (2) A novel taxonomy of context types used in code intelligence; (3) A task-oriented analysis investigating context integration strategies across diverse code intelligence tasks; (4) A critical evaluation of evaluation methodologies for context-aware methods. Based on these findings, we identify fundamental challenges in context utilization in current code intelligence systems and propose a research roadmap that outlines key opportunities for future research.
中文: 本文系统综述了146项代码智能研究,分析了上下文信息如何提升模型性能,并提出了分类体系、集成策略及未来研究方向。
English: This paper systematically reviews 146 studies on code intelligence, analyzing how contextual information enhances model performance and proposing a taxonomy, integration strategies, and future research directions.

Authors:Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng
Title: FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
Abstract:
Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their ability to comprehend and effectively leverage diverse types of feedback remains insufficiently understood. To bridge this gap, we introduce FeedbackEval, a systematic benchmark for evaluating LLMs' feedback comprehension and performance in code repair tasks. We conduct a comprehensive empirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5, Gemini-1.5, GLM-4, and Qwen2.5, to evaluate their behavior under both single-iteration and iterative code repair settings. Our results show that structured feedback, particularly in the form of test feedback, leads to the highest repair success rates, while unstructured feedback proves significantly less effective. Iterative feedback further enhances repair performance, though the marginal benefit diminishes after two or three rounds. Moreover, prompt structure is shown to be critical: incorporating docstrings, contextual information, and explicit guidelines substantially improves outcomes, whereas persona-based, chain-of-thought, and few-shot prompting strategies offer limited benefits in single-iteration scenarios. This work introduces a robust benchmark and delivers practical insights to advance the understanding and development of feedback-driven code repair using LLMs.
中文: 本研究提出FeedbackEval基准,发现结构化测试反馈能显著提升大语言模型的代码修复成功率,迭代反馈和清晰提示可进一步增强效果,而非结构化反馈及特定提示策略效果有限。
English: This study introduces FeedbackEval, a benchmark revealing that structured test feedback significantly boosts LLMs' code repair success, with iterative feedback and clear prompts further enhancing performance, while unstructured feedback and certain prompting strategies show limited effectiveness.

Authors:Yuyang Zhang, Baao Xie, Hu Zhu, Qi Wang, Huanting Guo, Xin Jin, Wenjun Zeng
Title: Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning
Abstract:
Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.
中文: 提出的3DisGS框架通过双分支架构和表示学习实现层次化三维语义解耦,在保持高质量重建的同时,首次实现了无需监督的可解释三维高斯泼溅。
English: The proposed 3DisGS framework introduces an interpretable single-view Gaussian Splatting approach that achieves hierarchical disentanglement of 3D semantics through dual-branch architecture and representation learning, enabling unsupervised semantic understanding while maintaining high-quality reconstruction.

Authors:Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Title: Affordable AI Assistants with Knowledge Graph of Thoughts
Abstract:
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
Chinese: 提出的思维知识图谱(KGoT)架构通过将大语言模型推理与动态知识图谱相结合,显著提升了AI助手的性能,在GAIA基准上的任务成功率提高了29%,且运行成本相比GPT-4o降低了超过36倍。
English: The proposed Knowledge Graph of Thoughts (KGoT) architecture enhances AI assistant performance by integrating LLM reasoning with dynamic knowledge graphs, achieving a 29% higher success rate on GAIA and reducing operational costs by over 36 times compared to GPT-4o.

Authors:Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Title: Affordable AI Assistants with Knowledge Graph of Thoughts
Abstract:
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
Chinese: 提出的思维知识图谱(KGoT)架构通过将大语言模型推理与动态知识图谱相结合,显著提升了AI助手的性能,在GAIA基准上的任务成功率提高了29%,且运行成本相比GPT-4o降低了超过36倍。
English: The proposed Knowledge Graph of Thoughts (KGoT) architecture enhances AI assistant performance by integrating LLM reasoning with dynamic knowledge graphs, achieving a 29% higher success rate on GAIA and reducing operational costs by over 36 times compared to GPT-4o.

Authors:Haozhe Yin, Kai Wang, Wenjie Zhang, Ying Zhang, Ruijia Wu, Xuemin Lin
Title: Efficient Computation of Hyper-triangles on Hypergraphs
Abstract:
Hypergraphs, which use hyperedges to capture groupwise interactions among different entities, have gained increasing attention recently for their versatility in effectively modeling real-world networks. In this paper, we study the problem of computing hyper-triangles (formed by three fully-connected hyperedges), which is a basic structural unit in hypergraphs. Although existing approaches can be adopted to compute hyper-triangles by exhaustively examining hyperedge combinations, they overlook the structural characteristics distinguishing different hyper-triangle patterns. Consequently, these approaches lack specificity in computing particular hyper-triangle patterns and exhibit low efficiency. In this paper, we unveil a new formation pathway for hyper-triangles, transitioning from hyperedges to hyperwedges before assembling into hyper-triangles, and classify hyper-triangle patterns based on hyperwedges. Leveraging this insight, we introduce a two-step framework to reduce the redundant checking of hyperedge combinations. Under this framework, we propose efficient algorithms for computing a specific pattern of hyper-triangles. Approximate algorithms are also devised to support estimated counting scenarios. Furthermore, we introduce a fine-grained hypergraph clustering coefficient measurement that can reflect diverse properties of hypergraphs based on different hyper-triangle patterns. Extensive experimental evaluations conducted on 11 real-world datasets validate the effectiveness and efficiency of our proposed techniques.
中文: 本文提出了一种新颖的两步框架,通过先分析超楔形结构来高效计算超图中的特定超三角形模式,并开发了近似计数方法和细粒度聚类系数测量,在多个真实数据集上验证了其有效性。
English: This paper introduces a novel two-step framework for efficiently computing specific hyper-triangle patterns in hypergraphs by first analyzing hyperwedges, along with approximate counting methods and a refined clustering coefficient measurement, all validated across multiple real-world datasets.

Authors:Zhaolin Wang, Chongjun Ouyang, Yuanwei Liu
Title: Beamforming Design for Continuous Aperture Array (CAPA)-Based MIMO Systems
Abstract:
An efficient beamforming design is proposed for continuous aperture array (CAPA)-based point-to-point multiple-input multiple-output (MIMO) systems. In contrast to conventional spatially discrete array (SPDA)-MIMO systems, whose optimal beamforming can be obtained using singular-value decomposition, CAPA-MIMO systems require solving the eigendecomposition of a Hermitian kernel operator, which is computationally prohibitive. To address this challenge, an explicit closed-form expression for the achievable rate of CAPA-MIMO systems is first derived as a function of the continuous transmit beamformer. Subsequently, an iterative weighted minimum mean-squared error (WMMSE) algorithm is proposed, directly addressing the CAPA-MIMO beamforming optimization without discretization approximation. Closed-form updates for each iteration of the WMMSE algorithm are derived via the calculus of variations (CoV) method. For low-complexity implementation, an equivalent matrix-based iterative solution is introduced using Gauss-Legendre quadrature. Our numerical results demonstrate that 1) CAPA-MIMO achieves substantial performance gain over the SPDA-MIMO, 2) the proposed WMMSE algorithm enhances performance while significantly reducing computational complexity compared to state-of-the-art Fourier-based approaches, and 3) the proposed WMMSE algorithm enables practical realization of parallel, non-interfering transmissions.
中文: 本研究针对CAPA-MIMO系统提出了一种高效的波束成形设计,采用迭代WMMSE算法,在提升性能的同时显著降低了计算复杂度。
English: This study introduces an efficient beamforming design for CAPA-MIMO systems using an iterative WMMSE algorithm that enhances performance while reducing computational complexity compared to conventional methods.

Authors:Shiqian Zhao, Jiayang Liu, Yiming Li, Runyi Hu, Xiaojun Jia, Wenshu Fan, Xinfeng Li, Jie Zhang, Wei Dong, Tianwei Zhang, Luu Anh Tuan
Title: When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems
Abstract:
Modern text-to-image (T2I) generation systems (e.g., DALL$\cdot$E 3) exploit the memory mechanism, which captures key information in multi-turn interactions for faithful generation. Despite its practicality, the security analyses of this mechanism have fallen far behind. In this paper, we reveal that it can exacerbate the risk of jailbreak attacks. Previous attacks fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or lead to the generation of non-unsafe images due to under- or over-detoxification. In contrast, we propose embedding the malice at the inception of the chat session in memory, addressing the above limitations. Specifically, we propose Inception, the first multi-turn jailbreak attack against real-world text-to-image generation systems that explicitly exploits their memory mechanisms. Inception is composed of two key modules: segmentation and recursion. We introduce Segmentation, a semantic-preserving method that generates multi-round prompts. By leveraging NLP analysis techniques, we design policies to decompose a prompt, together with its malicious intent, according to sentence structure, thereby evading safety filters. Recursion further addresses the challenge posed by unsafe sub-prompts that cannot be separated through simple segmentation. It firstly expands the sub-prompt, then invokes segmentation recursively. To facilitate multi-turn adversarial prompts crafting, we build VisionFlow, an emulation T2I system that integrates two-stage safety filters and industrial-grade memory mechanisms. The experiment results show that Inception successfully allures unsafe image generation, surpassing the SOTA by a 20.0\% margin in attack success rate. We also conduct experiments on the real-world commercial T2I generation platforms, further validating the threats of Inception in practice.
中文: 现代文生图系统的记忆机制虽提升了生成准确性,却加剧了越狱攻击风险,为此提出的Inception方法通过分段和递归处理提示,有效规避安全过滤,在攻击成功率上显著超越现有技术。
English: Modern text-to-image systems' memory mechanisms, while enhancing generation fidelity, are shown to heighten jailbreak risks, leading to the development of Inception—a multi-turn attack method that segments and recursively processes prompts to bypass safety filters effectively.

Authors:Qinyu Chen, Chang Gao, Min Liu, Daniele Perrone, Yan Ru Pei, Zuowen Wang, Zhuo Zou, Shihang Tan, Tao Han, Guorui Lu, Zhen Xu, Junyuan Ding, Ziteng Wang, Zongwei Wu, Han Han, Yuliang Wu, Jinze Chen, Wei Zhai, Yang Cao, Zheng-jun Zha, Nuwan Bandara, Thivya Kandappu, Archan Misra, Xiaopeng Lin, Hongxiang Huang, Hongwei Ren, Bojun Cheng, Hoang M. Truong, Vinh-Thuan Ly, Huy G. Tran, Thuan-Phat Nguyen, Tram T. Doan
Title: Event-Based Eye Tracking. 2025 Event-based Vision Workshop
Abstract:
This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.
Chinese: 本综述回顾了2025年基于事件的眼动追踪挑战赛,重点介绍了顶尖团队在瞳孔中心预测方面的创新方法,并分析了其精度、模型复杂度及硬件设计考量。
English: This survey reviews the 2025 Event-Based Eye Tracking Challenge, highlighting top teams' methods for pupil center prediction and analyzing their accuracy, model complexity, and hardware design implications.

Authors:Yating Wang, Xuan Wang, Ran Yi, Yanbo Fan, Jichen Hu, Jingcheng Zhu, Lizhuang Ma
Title: 3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations
Abstract:
Recent studies have combined 3D Gaussian and 3D Morphable Models (3DMM) to construct high-quality 3D head avatars. In this line of research, existing methods either fail to capture the dynamic textures or incur significant overhead in terms of runtime speed or storage space. To this end, we propose a novel method that addresses all the aforementioned demands. In specific, we introduce an expressive and compact representation that encodes texture-related attributes of the 3D Gaussians in the tensorial format. We store appearance of neutral expression in static tri-planes, and represents dynamic texture details for different expressions using lightweight 1D feature lines, which are then decoded into opacity offset relative to the neutral face. We further propose adaptive truncated opacity penalty and class-balanced sampling to improve generalization across different expressions. Experiments show this design enables accurate face dynamic details capturing while maintains real-time rendering and significantly reduces storage costs, thus broadening the applicability to more scenarios.
中文: 本研究提出了一种新方法,通过静态三平面编码中性表情,结合轻量级一维特征线处理动态纹理,实现了实时高质量3D头部虚拟形象渲染,同时显著降低了存储成本。
English: This study introduces a novel method that combines static tri-planes for neutral expressions with lightweight 1D feature lines for dynamic textures, enabling real-time, high-quality 3D head avatar rendering with reduced storage costs.

Authors:Yonghui Li, Chentao Yue, Branka Vucetic
Title: Optimal Linear MAP Decoding of Convolutional Codes
Abstract:
In this paper, we propose a linear representation of BCJR maximum a posteriori probability (MAP) decoding of a rate 1/2 convolutional code (CC), referred to as the linear MAP decoding (LMAP). We discover that the MAP forward and backward decoding can be implemented by the corresponding dual soft input and soft output (SISO) encoders using shift registers. The bidrectional MAP decoding output can be obtained by combining the contents of respective forward and backward dual encoders. Represented using simple shift-registers, LMAP decoder maps naturally to hardware registers and thus can be easily implemented. Simulation results demonstrate that the LMAP decoding achieves the same performance as the BCJR MAP decoding, but has a significantly reduced decoding delay. For the block length 64, the CC of the memory length 14 with LMAP decoding surpasses the random coding union (RCU) bound by approximately 0.5 dB at a BLER of $10^{-3}$, and closely approaches both the normal approximation (NA) and meta-converse (MC) bounds.
中文: 本文提出了一种针对1/2码率卷积码的线性最大后验概率(LMAP)译码方法,通过采用带移位寄存器的双软输入软输出编码器,在保持BCJR MAP性能的同时显著降低译码延迟,并在64位块长下接近理论界限。
English: This paper introduces a linear MAP (LMAP) decoding method for rate 1/2 convolutional codes that uses dual SISO encoders with shift registers to achieve BCJR MAP performance with significantly reduced delay, closely approaching theoretical bounds at a block length of 64.

Authors:Muhammad Haseeb Aslam, Clara Martinez, Marco Pedersoli, Alessandro Koerich, Ali Etemad, Eric Granger
Title: Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation
Abstract:
Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.
中文摘要:本文提出了一种新颖的随机自蒸馏方法,通过蒸馏时dropout生成多样化的教师表征,并利用学生引导的知识蒸馏进行筛选和加权,在不增加模型复杂度的前提下超越了现有最优方法。
English Summary: The paper introduces a novel stochastic self-distillation (SSD) method that trains a single model using distillation-time dropout to generate diverse teacher representations, which are then filtered and weighted through student-guided knowledge distillation to outperform state-of-the-art methods without increasing model complexity.

Authors:Runyi Hu, Jie Zhang, Shiqian Zhao, Nils Lukas, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Title: Mask Image Watermarking
Abstract:
We present MaskMark, a simple, efficient, and flexible framework for image watermarking. MaskMark has two variants: (1) MaskMark-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection; (2) MaskMark-ED, which focuses on local watermark embedding and extraction, offering enhanced robustness in small regions to support fine-grined image protection. MaskMark-D builds on the classical encoder-distortion layer-decoder training paradigm. In MaskMark-D, we introduce a simple masking mechanism during the decoding stage that enables both global and local watermark extraction. During training, the decoder is guided by various types of masks applied to watermarked images before extraction, helping it learn to localize watermarks and extract them from the corresponding local areas. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions, which improves robustness under regional attacks. Extensive experiments show that MaskMark achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. In addition, MaskMark is highly efficient and adaptable. It requires only 20 hours of training on a single A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply adjusting the distortion layer, MaskMark can be quickly fine-tuned to meet varying robustness requirements.
中文:MaskMark是一种简单高效的图像水印框架,包含支持全局和局部水印处理的MaskMark-D与专注局部鲁棒性的MaskMark-ED两种变体,在各项任务中均实现最优性能,同时保持高视觉质量和计算效率。
English: MaskMark is a versatile image watermarking framework with two variants—MaskMark-D for global and local watermark tasks and MaskMark-ED for enhanced local robustness—achieving top performance, high efficiency, and superior visual quality compared to existing methods.

Authors:Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timofte, Yawei Li, Xiangyu Kong, Hyunhee Park, Xiaoxuan Yu, Suejin Han, Hakjae Jeon, Jia Li, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Jingyu Ma, Zhijuan Huang, Huiyuan Fu, Hongyuan Yu, Boqi Zhang, Jiawei Shi, Heng Zhang, Huadong Ma, Deepak Kumar Tyagi, Aman Kukretti, Gajender Sharma, Sriharsha Koundinya, Asim Manna, Jun Cheng, Shan Tan, Jun Liu, Jiangwei Hao, Jianping Luo, Jie Lu, Satya Narayan Tazi, Arnim Gautam, Aditi Pawar, Aishwarya Joshi, Akshay Dudhane, Praful Hambadre, Sachin Chaudhary, Santosh Kumar Vipparthi, Subrahmanyam Murala, Jiachen Tu, Nikhil Akalwadi, Vijayalaxmi Ashok Aralikatti, Dheeraj Damodar Hegde, G Gyaneshwar Rao, Jatin Kalal, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Zhenyuan Lin, Yubo Dong, Weikun Li, Anqi Li, Ang Gao, Weijun Yuan, Zhan Li, Ruting Deng, Yihang Chen, Yifan Deng, Zhanglu Chen, Boyang Yao, Shuling Zheng, Feng Zhang, Zhiheng Fu, Anas M. Ali, Bilel Benjdira, Wadii Boulila, Jan Seny, Pei Zhou, Jianhua Hu, K. L. Eddie Law, Jaeho Lee, M. J. Aashik Rasool, Abdur Rehman, SMA Sharif, Seongwan Kim, Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Cosmin Ancuti, Zeyu Xiao, Zhuoyuan Li, Ziqi Wang, Yanyan Wei, Fei Wang, Kun Li, Shengeng Tang, Yunkai Zhang, Weirun Zhou, Haoxuan Lu
Title: The Tenth NTIRE 2025 Image Denoising Challenge Report
Abstract:
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
中文: 本文概述了NTIRE 2025图像去噪挑战赛,重点介绍了在固定高斯噪声条件下实现高PSNR值的先进网络架构,共有20支团队提交了代表当前最高水平的结果。
English: This paper summarizes the NTIRE 2025 Image Denoising Challenge, focusing on advanced network architectures that achieved high PSNR scores for removing fixed Gaussian noise, with 20 teams contributing state-of-the-art results.

Authors:Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie
Title: SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Abstract:
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.
中文: 本研究发现监督微调会通过引入僵化的模仿推理阻碍大型视觉语言模型的强化学习,但采用混合奖励的新型强化学习方法能促进自适应推理并实现顶尖性能。
English: This study finds that supervised fine-tuning can hinder reinforcement learning in large vision-language models by introducing rigid imitative reasoning, but a novel RL approach with mixed rewards fosters adaptive reasoning and achieves state-of-the-art performance.

Authors:Zeming Wei, Junyi Lin, Yang Liu, Weixing Chen, Jingzhou Luo, Guanbin Li, Liang Lin
Title: 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians
Abstract:
3D affordance reasoning is essential in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,677 Gaussian instances, 8,354 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities.
中文: 3DAffordSplat推出了首个基于3D高斯泼溅的大规模数据集和AffordSplatNet模型,通过跨模态对齐和广泛实验验证,解决了先前方法在泛化性和准确性上的局限,显著提升了三维功能推理性能。
English: 3DAffordSplat introduces the first large-scale dataset and AffordSplatNet model for 3D affordance reasoning using 3D Gaussian Splatting, overcoming previous limitations in generalization and accuracy through cross-modal alignment and extensive experimental validation.

Authors:Yuqian Fu, Xingyu Qiu, Bin Ren, Yanwei Fu, Radu Timofte, Nicu Sebe, Ming-Hsuan Yang, Luc Van Gool, Kaijin Zhang, Qingpeng Nong, Xiugang Dong, Hong Gao, Xiangsheng Zhou, Jiancheng Pan, Yanxing Liu, Xiao He, Jiahao Li, Yuze Sun, Xiaomeng Huang, Zhenyu Zhang, Ran Ma, Yuhan Liu, Zijian Zhuang, Shuai Yi, Yixiong Zou, Lingyi Hong, Mingxi Chen, Runze Li, Xingdong Sheng, Wenqiang Zhang, Weisen Chen, Yongxin Yan, Xinguo Chen, Yuanjie Shao, Zhengrong Zuo, Nong Sang, Hao Wu, Haoran Sun, Shuming Hu, Yan Zhang, Zhiguang Shi, Yu Zhang, Chao Chen, Tao Wang, Da Feng, Linhai Zhuo, Ziming Lin, Yali Huang, Jie Me, Yiming Yang, Mi Guo, Mingyuan Jiu, Mingliang Xu, Maomao Xiong, Qunshu Zhang, Xinyu Cao, Yuqing Yang, Dianmo Sheng, Xuanpu Zhao, Zhiyu Li, Xuyang Ding, Wenqian Li
Title: NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results
Abstract:
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
中文: 首届NTIRE 2025跨域少样本目标检测挑战赛成功推动了该领域发展,吸引了众多参与者开发创新模型,在有限标注数据的新领域实现了最先进的检测性能。
English: The 1st NTIRE 2025 CD-FSOD Challenge successfully advanced cross-domain few-shot object detection by attracting numerous participants who developed innovative models achieving state-of-the-art results with limited labeled data in novel domains.

Authors:Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Xuanjing Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei
Title: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
Abstract:
Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.
中文: SocioVerse作为一种基于大语言模型的社会模拟世界模型,通过四个强大的对齐组件和千万级真实用户池,在政治、新闻和经济领域的大规模实验中展现出能有效反映群体动态,并确保多样性、可信度和代表性。
English: SocioVerse is an advanced LLM-driven world model that enhances social simulation by incorporating four alignment components and a vast user pool, effectively capturing large-scale population dynamics with diversity and credibility across political, news, and economic domains.

Authors:Shuai Zhao, Linchao Zhu, Yi Yang
Title: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Abstract:
Large language models~(LLMs) are expected to be helpful, harmless, and honest. In alignment scenarios such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but essential for transferring human preference. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function choice for LLM alignment. Similarity reward circumvents binary preference data collection and reward modeling when unary high-quality reference answers are available. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reference or reward models. RefAlign utilizes similarity metrics, such as BERTScore between sampled generations and reference answers as surrogate rewards. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, RefAlign demonstrates comparable performance to previous alignment methods without binary preference data and reward models.
中文: RefAlign是一种基于强化学习的对齐方法,它通过计算生成回答与高质量参考答案之间的相似度(如BERTScore)作为奖励信号,无需二元偏好数据和奖励模型即可实现与现有方法相当的性能。
English: RefAlign is a reinforcement learning-based alignment method that uses similarity metrics like BERTScore between generated responses and high-quality references as rewards, eliminating the need for binary preference data and reward models while achieving performance comparable to prior approaches.

Authors:Rong Yao, Hailin Hu, Yifei Fu, Hanting Chen, Wenyi Fang, Fanyi Du, Kai Han, Yunhe Wang
Title: Transferable text data distillation by trajectory matching
Abstract:
In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
中文: 本研究提出了一种适用于文本生成任务的数据蒸馏方法,通过伪提示学习和轨迹匹配合成精简数据集,在性能上超越了现有最优数据选择方法并展现出良好的跨架构迁移能力。
English: This study introduces a novel data distillation method for text generation tasks that synthesizes compact datasets through pseudo prompt learning and trajectory matching, demonstrating superior performance over state-of-the-art selection methods and cross-architecture transferability.

Authors:Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Title: Don't Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs
Abstract:
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their vulnerability to user gaslighting-the deliberate use of misleading or contradictory inputs-raises critical concerns about their reliability in real-world applications. In this paper, we address the novel and challenging issue of mitigating the negative impact of negation-based gaslighting on LMMs, where deceptive user statements lead to significant drops in model accuracy. Specifically, we introduce GasEraser, a training-free approach that reallocates attention weights from misleading textual tokens to semantically salient visual regions. By suppressing the influence of "attention sink" tokens and enhancing focus on visually grounded cues, GasEraser significantly improves LMM robustness without requiring retraining or additional supervision. Extensive experimental results demonstrate that GasEraser is effective across several leading open-source LMMs on the GaslightingBench. Notably, for LLaVA-v1.5-7B, GasEraser reduces the misguidance rate by 48.2%, demonstrating its potential for more trustworthy LMMs.
中文: 大型多模态模型易受用户误导,但GasEraser方法通过将注意力从误导性文本重新分配到视觉线索,无需重新训练即可显著增强模型的鲁棒性。
English: Large Multimodal Models are susceptible to user gaslighting, but the proposed GasEraser method effectively mitigates this by reallocating attention from misleading text to visual cues, significantly enhancing model robustness without retraining.

Authors:Junhao Xu, Jingjing Chen, Yang Jiao, Jiacheng Zhang, Zhiyu Tan, Hao Li, Yu-Gang Jiang
Title: Identity-Aware Vision-Language Model for Explainable Face Forgery Detection
Abstract:
Recent advances in generative artificial intelligence have enabled the creation of highly realistic image forgeries, raising significant concerns about digital media authenticity. While existing detection methods demonstrate promising results on benchmark datasets, they face critical limitations in real-world applications. First, existing detectors typically fail to detect semantic inconsistencies with the person's identity, such as implausible behaviors or incompatible environmental contexts in given images. Second, these methods rely heavily on low-level visual cues, making them effective for known forgeries but less reliable against new or unseen manipulation techniques. To address these challenges, we present a novel personalized vision-language model (VLM) that integrates low-level visual artifact analysis and high-level semantic inconsistency detection. Unlike previous VLM-based methods, our approach avoids resource-intensive supervised fine-tuning that often struggles to preserve distinct identity characteristics. Instead, we employ a lightweight method that dynamically encodes identity-specific information into specialized identifier tokens. This design enables the model to learn distinct identity characteristics while maintaining robust generalization capabilities. We further enhance detection capabilities through a lightweight detection adapter that extracts fine-grained information from shallow features of the vision encoder, preserving critical low-level evidence. Comprehensive experiments demonstrate that our approach achieves 94.25% accuracy and 94.08% F1 score, outperforming both traditional forgery detectors and general VLMs while requiring only 10 extra tokens.
Chinese: 本文提出了一种新型个性化视觉语言模型,通过结合低级视觉伪影分析和高级语义不一致性检测,有效识别图像伪造,以最小计算开销实现了卓越性能。
English: This paper introduces a novel personalized vision-language model that combines low-level visual artifact analysis with high-level semantic inconsistency detection to effectively identify image forgeries, achieving superior performance with minimal computational overhead.

Authors:Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, Tat-Seng Chua
Title: SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
Abstract:
The rapid advancement of multi-modal large reasoning models (MLRMs) -- enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities -- has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates -- far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction -- 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs' self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.
中文: 该研究首次系统性揭示多模态大推理模型存在严重安全隐患,包括越狱成功率激增37.44%和特定场景风险激增25倍,同时发现其具备16.9%的自我修正能力,亟需建立场景化安全审计体系。
English: The study reveals that multi-modal large reasoning models (MLRMs) exhibit significantly higher safety vulnerabilities, including a 37.44% increase in jailbreaking success rates and scenario-specific risks, while also showing emergent self-correction capabilities, highlighting the urgent need for enhanced safety mechanisms.

Authors:Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li
Title: DreamFuse: Adaptive Image Fusion with Diffusion Transformer
Abstract:
Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
中文摘要:本研究提出DreamFuse方法,通过人机协同数据生成流程和位置仿射机制实现前景与背景的自适应交互,在保持图像协调融合的同时支持文本驱动编辑,各项指标均优于现有先进方法。
English Summary: The study introduces DreamFuse, a diffusion-based method that uses a human-in-the-loop pipeline and Positional Affine mechanism to generate harmonious image fusions by enabling adaptive foreground-background interactions, outperforming existing techniques.

Authors:Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong
Title: Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models
Abstract:
Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.
Chinese: 本研究揭示了长上下文模型中内在知识被低估的作用,表明其影响随上下文延长而增强,且外部检索能力的提升可能阻碍内在知识的运用,由此开发出双检索评估方法,其中Qwen-2.5模型表现优于Llama-3.1。
English: This study highlights the underappreciated role of intrinsic knowledge in long-context models, showing that its influence grows with context length and that enhanced extrinsic retrieval can hinder intrinsic knowledge use, leading to the development of a dual-retrieval evaluation method where Qwen-2.5 outperforms Llama-3.1.

Authors:Jinze Chen, Wei Zhai, Yang Cao, Bin Li, Zheng-Jun Zha
Title: Event Signal Filtering via Probability Flux Estimation
Abstract:
Events offer a novel paradigm for capturing scene dynamics via asynchronous sensing, but their inherent randomness often leads to degraded signal quality. Event signal filtering is thus essential for enhancing fidelity by reducing this internal randomness and ensuring consistent outputs across diverse acquisition conditions. Unlike traditional time series that rely on fixed temporal sampling to capture steady-state behaviors, events encode transient dynamics through polarity and event intervals, making signal modeling significantly more complex. To address this, the theoretical foundation of event generation is revisited through the lens of diffusion processes. The state and process information within events is modeled as continuous probability flux at threshold boundaries of the underlying irradiance diffusion. Building on this insight, a generative, online filtering framework called Event Density Flow Filter (EDFilter) is introduced. EDFilter estimates event correlation by reconstructing the continuous probability flux from discrete events using nonparametric kernel smoothing, and then resamples filtered events from this flux. To optimize fidelity over time, spatial and temporal kernels are employed in a time-varying optimization framework. A fast recursive solver with O(1) complexity is proposed, leveraging state-space models and lookup tables for efficient likelihood computation. Furthermore, a new real-world benchmark Rotary Event Dataset (RED) is released, offering microsecond-level ground truth irradiance for full-reference event filtering evaluation. Extensive experiments validate EDFilter's performance across tasks like event filtering, super-resolution, and direct event-based blob tracking. Significant gains in downstream applications such as SLAM and video reconstruction underscore its robustness and effectiveness.
中文: 该摘要提出EDFilter在线事件过滤框架,将事件生成建模为扩散过程并通过重建连续概率通量来降低随机性,实验验证其在SLAM和视频重建等应用中性能显著提升。
English: The abstract introduces EDFilter, an online event filtering framework that models event generation as a diffusion process and reconstructs continuous probability flux to reduce randomness, with experimental validation showing improved performance in applications like SLAM and video reconstruction.

Authors:Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong
Title: SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets
Abstract:
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.
中文摘要:本文提出了一种潜在空间生成范式,通过高斯压缩和条件生成将三维人体数字化转化为可学习的分布偏移问题,同时构建了百万规模的HGS-1M数据集,实现了具有精细纹理和衣物变形的高质量三维人体重建。
English Summary: This paper introduces a latent space generation paradigm that transforms 3D human digitization into a learnable distribution shift using Gaussian compression and conditional generation, while also creating a million-scale HGS-1M dataset to enable high-quality 3D human reconstruction with detailed textures and clothing deformation.

Authors:Boyang Zuo, Xiao Zhang, Feng Li, Pengjie Wang, Jian Xu, Bo Zheng
Title: VALUE: Value-Aware Large Language Model for Query Rewriting via Weighted Trie in Sponsored Search
Abstract:
In the realm of sponsored search advertising, matching advertisements with the search intent of a user's query is crucial. Query-to-bidwords(i.e. bidding keywords) rewriting is a vital technique that has garnered significant attention. Recently, with the prevalence of LLMs, generative retrieval methods have proven effective in producing high-relevance rewrites. However, we have identified a significant limitation in existing approaches: While fine-tuning LLMs for specific domains enhances semantic relevance, these models have no perception of the intrinsic value of their generated outputs, such as commercial value. Therefore, after SFT, a RLHF phase is often employed to address this issue. Nevertheless, traditional preference alignment methods often face challenges in aligning fine-grained values and are susceptible to overfitting, which diminishes the effectiveness and quality of the generated results. To address these challenges, we propose VALUE(Value-Aware Large language model for qUery rewriting via wEighted trie), the first framework that ensures the generation of high-value and highly relevant bidwords. Our approach utilizes weighted trie, an innovative modification of the traditional trie data structure. By modulating the LLM's output probability distribution with value information from the trie during decoding process, we constrain the generation space and guide the trajectory of text production. Offline experiments demonstrate the effectiveness of our method in semantic matching and preference alignment, showing a remarkable improvement in the value attribute by more than fivefold. Online A/B tests further revealed that our Revenue Per Mille (RPM) metric increased by 1.64%. VALUE has been deployed on our advertising system since October 2024 and served the Double Eleven promotions, the biggest shopping carnival in China.
中文: VALUE框架通过加权字典树结构在查询重写中引入价值感知指导,显著提升了竞价关键词的语义相关性和商业价值,已在广告系统中成功部署。
English: The VALUE framework enhances sponsored search advertising by integrating value-aware guidance through a weighted trie structure, significantly improving both semantic relevance and commercial performance of generated bidwords.

Authors:Yizhe Tang, Zhimin Sun, Yuzhen Du, Ran Yi, Guangben Lu, Teng Hu, Luying Li, Lizhuang Ma, Fangyuan Zou
Title: A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting
Abstract:
Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject's original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (A$^\text{T}$A) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip A$^\text{T}$A with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our A$^\text{T}$A approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.
中文: 本文提出了"文本引导的主体位置可变背景修复"新任务,并开发了自适应变换代理(A$^\text{T}$A)方法,通过预测最佳位移实现主体与背景的和谐融合,在位置可变和固定两种场景下均展现出卓越性能。
English: This paper introduces a new task called Text-Guided Subject-Position Variable Background Inpainting and proposes the Adaptive Transformation Agent (A$^\text{T}$A) to dynamically adjust subject positions for harmonious background integration, demonstrating superior performance in both variable and fixed position scenarios.

Authors:Xingshan Zeng, Weiwen Liu, Xu Huang, Zezhong Wang, Lingzhi Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruiming Tang, Qun Liu
Title: ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning
Abstract:
Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, existing approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel framework that includes both model-aware iterative training and adaptive refinement for tool learning. ToolACE-R features a model-aware iterative training procedure that progressively adjust training samples based on the model's evolving capabilities to maximize its potential. Additionally, it incorporates self-refinement training corpus which emphasizes LLM's ability to iteratively refine their tool calls, optimizing performance without requiring external feedback. Furthermore, we introduce adaptive self-refinement mechanism for efficient test-time scaling, where the trained model can autonomously determine when to stop the process based on iterative self-refinement. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models. The performance of tool invocation can be further improved efficiently through adaptive self-refinement. These results highlight the effectiveness and generalizability of ToolACE-R, offering a promising direction for more efficient and scalable tool learning.
中文: 本文提出ToolACE-R框架,通过模型感知的迭代训练和自适应优化机制增强大语言模型的工具学习能力,在无需外部反馈的情况下实现高效性能提升和泛化能力。
English: This paper introduces ToolACE-R, a novel framework that enhances tool learning for Large Language Models through model-aware iterative training and adaptive self-refinement, achieving competitive performance and improved efficiency without external feedback.

Authors:Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qin, Kui Ren, Chun Chen
Title: DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization
Abstract:
Recent research has focused on exploring the vulnerabilities of Large Language Models (LLMs), aiming to elicit harmful and/or sensitive content from LLMs. However, due to the insufficient research on dual-jailbreaking -- attacks targeting both LLMs and Guardrails, the effectiveness of existing attacks is limited when attempting to bypass safety-aligned LLMs shielded by guardrails. Therefore, in this paper, we propose DualBreach, a target-driven framework for dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI) strategy to dynamically construct initial prompts, combined with a Multi-Target Optimization (MTO) method that utilizes approximate gradients to jointly adapt the prompts across guardrails and LLMs, which can simultaneously save the number of queries and achieve a high dual-jailbreaking success rate. For black-box guardrails, DualBreach either employs a powerful open-sourced guardrail or imitates the target black-box guardrail by training a proxy model, to incorporate guardrails into the MTO process. We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios through extensive evaluation on several widely-used datasets. Experimental results indicate that DualBreach outperforms state-of-the-art methods with fewer queries, achieving significantly higher success rates across all settings. More specifically, DualBreach achieves an average dual-jailbreaking success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the best success rate achieved by other methods is 88.33%. Moreover, DualBreach only uses an average of 1.77 queries per successful dual-jailbreak, outperforming other state-of-the-art methods. For the purpose of defense, we propose an XGBoost-based ensemble defensive mechanism named EGuard, which integrates the strengths of multiple guardrails, demonstrating superior performance compared with Llama-Guard-3.
近期研究提出DualBreach这一目标驱动框架,通过动态提示初始化与多目标优化方法,在少量查询下高效突破安全对齐大语言模型与防护机制的双重防御,其成功率显著超越现有技术。
Recent research has introduced DualBreach, a target-driven framework that employs dynamic prompt initialization and multi-target optimization to efficiently bypass both safety-aligned LLMs and guardrails, achieving higher success rates with fewer queries compared to existing methods.

Authors:Yui Lo, Yuqian Chen, Dongnan Liu, Leo Zekelman, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Fan Zhang, Weidong Cai, Lauren J. O'Donnell
Title: A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Abstract:
Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
中文: Tract2Shape是一种多模态深度学习框架,能高效准确地预测白质纤维束形状指标,相比现有方法在跨数据集测试中展现出卓越性能和强大泛化能力。
English: Tract2Shape is a multimodal deep learning framework that efficiently and accurately predicts white matter tractography shape measures, demonstrating superior performance and strong generalizability across datasets compared to existing methods.

Authors:Guyue Hu, Siyuan Song, Yukun Kang, Zhu Yin, Gangming Zhao, Chenglong Li, Jin Tang
Title: Federated Client-tailored Adapter for Medical Image Segmentation
Abstract:
Medical image segmentation in X-ray images is beneficial for computer-aided diagnosis and lesion localization. Existing methods mainly fall into a centralized learning paradigm, which is inapplicable in the practical medical scenario that only has access to distributed data islands. Federated Learning has the potential to offer a distributed solution but struggles with heavy training instability due to client-wise domain heterogeneity (including distribution diversity and class imbalance). In this paper, we propose a novel Federated Client-tailored Adapter (FCA) framework for medical image segmentation, which achieves stable and client-tailored adaptive segmentation without sharing sensitive local data. Specifically, the federated adapter stirs universal knowledge in off-the-shelf medical foundation models to stabilize the federated training process. In addition, we develop two client-tailored federated updating strategies that adaptively decompose the adapter into common and individual components, then globally and independently update the parameter groups associated with common client-invariant and individual client-specific units, respectively. They further stabilize the heterogeneous federated learning process and realize optimal client-tailored instead of sub-optimal global-compromised segmentation models. Extensive experiments on three large-scale datasets demonstrate the effectiveness and superiority of the proposed FCA framework for federated medical segmentation.
中文摘要:提出的联邦客户定制适配器(FCA)框架通过利用基础模型和客户端特定更新策略,在不共享敏感数据的情况下实现了稳定的分布式医学图像分割。
English Summary: The proposed Federated Client-tailored Adapter (FCA) framework enables stable, distributed medical image segmentation by leveraging foundation models and client-specific updating strategies without sharing sensitive data.

Authors:Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Zheng Lin, Li Cao, Weiping Wang
Title: Dynamic Early Exit in Reasoning Models
Abstract:
Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,"Wait" tokens) and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.
中文摘要:本研究提出一种思维链自截断方法,通过动态监测模型在推理转折点的置信度实现早期终止,在11个前沿推理模型上平均缩短推理长度19.1%-80.1%的同时提升准确率0.3%-5.0%。
English Summary: This study introduces a self-truncation method for large reasoning language models that dynamically halts chain-of-thought generation at high-confidence points, reducing reasoning length by 19.1%-80.1% while improving accuracy across multiple benchmarks.

Authors:Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, Weiping Wang
Title: Dynamic Early Exit in Reasoning Models
Abstract:
Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.
中文摘要:本研究提出一种思维链自截断方法,通过动态监测模型在推理转折点的置信度实现早期终止,在11个前沿推理模型上平均缩短推理长度19.1%-80.1%的同时提升准确率0.3%-5.0%。
English Summary: This study introduces a self-truncation method for large reasoning language models that dynamically halts chain-of-thought generation at high-confidence points, reducing reasoning length by 19.1%-80.1% while improving accuracy across multiple benchmarks.

Authors:Jinda Lu, Jinghan Li, Yuan Gao, Junkang Wu, Jiancan Wu, Xiang Wang, Xiangnan He
Title: AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization
Abstract:
Preference alignment through Direct Preference Optimization (DPO) has demonstrated significant effectiveness in aligning multimodal large language models (MLLMs) with human preferences. However, existing methods focus primarily on language preferences while neglecting the critical visual context. In this paper, we propose an Adaptive Vision-enhanced Preference optimization (AdaViP) that addresses these limitations through two key innovations: (1) vision-based preference pair construction, which integrates multiple visual foundation models to strategically remove key visual elements from the image, enhancing MLLMs' sensitivity to visual details; and (2) adaptive preference optimization that dynamically balances vision- and language-based preferences for more accurate alignment. Extensive evaluations across different benchmarks demonstrate our effectiveness. Notably, our AdaViP-7B achieves 93.7% and 96.4% reductions in response-level and mentioned-level hallucination respectively on the Object HalBench, significantly outperforming current state-of-the-art methods.
中文: 本文提出的AdaViP方法通过视觉偏好构建和自适应优化策略,增强了多模态大语言模型对视觉细节的敏感性,在降低幻觉方面表现卓越,显著优于现有最优方法。
English: This paper introduces AdaViP, an adaptive vision-enhanced preference optimization method that improves multimodal large language models by integrating vision-based preference construction and dynamic preference balancing, significantly reducing hallucinations and outperforming existing approaches.

Authors:Moyang Liu, Kaiying Yan, Yukun Liu, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li
Title: Deconfounded Reasoning for Multimodal Fake News Detection via Causal Intervention
Abstract:
The rapid growth of social media has led to the widespread dissemination of fake news across multiple content forms, including text, images, audio, and video. Traditional unimodal detection methods fall short in addressing complex cross-modal manipulations; as a result, multimodal fake news detection has emerged as a more effective solution. However, existing multimodal approaches, especially in the context of fake news detection on social media, often overlook the confounders hidden within complex cross-modal interactions, leading models to rely on spurious statistical correlations rather than genuine causal mechanisms. In this paper, we propose the Causal Intervention-based Multimodal Deconfounded Detection (CIMDD) framework, which systematically models three types of confounders via a unified Structural Causal Model (SCM): (1) Lexical Semantic Confounder (LSC); (2) Latent Visual Confounder (LVC); (3) Dynamic Cross-Modal Coupling Confounder (DCCC). To mitigate the influence of these confounders, we specifically design three causal modules based on backdoor adjustment, frontdoor adjustment, and cross-modal joint intervention to block spurious correlations from different perspectives and achieve causal disentanglement of representations for deconfounded reasoning. Experimental results on the FakeSV and FVC datasets demonstrate that CIMDD significantly improves detection accuracy, outperforming state-of-the-art methods by 4.27% and 4.80%, respectively. Furthermore, extensive experimental results indicate that CIMDD exhibits strong generalization and robustness across diverse multimodal scenarios.
中文: CIMDD框架通过因果干预技术解决多模态虚假新闻检测中的混杂因素,相比现有方法显著提升了检测准确性与鲁棒性。
English: The CIMDD framework addresses confounders in multimodal fake news detection through causal intervention techniques, significantly improving accuracy and robustness over existing methods.

Authors:Moyang Liu, Kaiying Yan, Yukun Liu, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li
Title: Exploring Modality Disruption in Multimodal Fake News Detection
Abstract:
The rapid growth of social media has led to the widespread dissemination of fake news across multiple content forms, including text, images, audio, and video. Compared to unimodal fake news detection, multimodal fake news detection benefits from the increased availability of information across multiple modalities. However, in the context of social media, certain modalities in multimodal fake news detection tasks may contain disruptive or over-expressive information. These elements often include exaggerated or embellished content. We define this phenomenon as modality disruption and explore its impact on detection models through experiments. To address the issue of modality disruption in a targeted manner, we propose a multimodal fake news detection framework, FND-MoE. Additionally, we design a two-pass feature selection mechanism to further mitigate the impact of modality disruption. Extensive experiments on the FakeSV and FVC-2018 datasets demonstrate that FND-MoE significantly outperforms state-of-the-art methods, with accuracy improvements of 3.45% and 3.71% on the respective datasets compared to baseline models.
中文摘要:本研究针对多模态虚假新闻检测中的模态干扰问题,提出FND-MoE框架及双通道特征选择机制,在基准数据集上相比基线模型实现了显著准确率提升。
English Summary: The study addresses modality disruption in multimodal fake news detection by proposing the FND-MoE framework with a two-pass feature selection mechanism, achieving significant accuracy improvements over baseline models on benchmark datasets.

Authors:Zexu Wang, Jiachi Chen, Tao Zhang, Yu Zhang, Weizhe Zhang, Yuming Feng, Zibin Zheng
Title: Copy-and-Paste? Identifying EVM-Inequivalent Code Smells in Multi-chain Reuse Contracts
Abstract:
As the development of Solidity contracts on Ethereum, more developers are reusing them on other compatible blockchains. However, developers may overlook the differences between the designs of the blockchain system, such as the Gas Mechanism and Consensus Protocol, leading to the same contracts on different blockchains not being able to achieve consistent execution as on Ethereum. This inconsistency reveals design flaws in reused contracts, exposing code smells that hinder code reusability, and we define this inconsistency as EVM-Inequivalent Code Smells. In this paper, we conducted the first empirical study to reveal the causes and characteristics of EVM-Inequivalent Code Smells. To ensure the identified smells reflect real developer concerns, we collected and analyzed 1,379 security audit reports and 326 Stack Overflow posts related to reused contracts on EVM-compatible blockchains, such as Binance Smart Chain (BSC) and Polygon. Using the open card sorting method, we defined six types of EVM-Inequivalent Code Smells. For automated detection, we developed a tool named EquivGuard. It employs static taint analysis to identify key paths from different patterns and uses symbolic execution to verify path reachability. Our analysis of 905,948 contracts across six major blockchains shows that EVM-Inequivalent Code Smells are widespread, with an average prevalence of 17.70%. While contracts with code smells do not necessarily lead to financial loss and attacks, their high prevalence and significant asset management underscore the potential threats of reusing these smelly Ethereum contracts. Thus, developers are advised to abandon Copy-and-Paste programming practices and detect EVM-Inequivalent Code Smells before reusing Ethereum contracts.
中文: 本研究揭示了以太坊合约在其他区块链上重用因系统设计差异导致的EVM不等价代码异味,并开发了自动检测工具EquivGuard,发现其广泛存在且具有潜在威胁。
English: This study identifies EVM-Inequivalent Code Smells, inconsistencies arising when Ethereum contracts are reused on other blockchains due to system design differences, and develops EquivGuard for automated detection, revealing their widespread prevalence and potential threats.

Authors:Zhentian Zhang, Kai-Kit Wong, Jian Dang, Zaichen Zhang, Chan-Byoung Chae
Title: On Fundamental Limits for Fluid Antenna-assisted Integrated Sensing and Communications for Unsourced Random Access
Abstract:
This paper investigates the unsourced random access (URA) problem for integrated sensing and commutations (ISAC). Recent results reveal that conventional multiple access strategies for ISAC such as treating interference as noise (TIN) and time-division multiple access (TDMA) can be easily overwhelmed and fail to support the increasingly surging number of active users. Hence, the unsourced ISAC (UNISAC) system model has emerged as a promising enabler for the future ISAC networks. To advance this work, we adopt a more realistic channel model and propose to utilize fluid antenna system (FAS) for UNISAC. The achievable performance bound and floor of the proposed FAS-UNISAC are derived to validate the great potential. Our results demonstrate that promising improvement on the available user volume and the sensing and communication capability can be obtained due to the spatial diversities inherent within fluid antenna.
中文: 本文提出了一种用于无源集成感知与通信(UNISAC)的流体天线系统(FAS),通过利用空间分集,验证了其在提升用户容量和通信感知能力方面的巨大潜力。
English: This paper introduces a fluid antenna system (FAS) for unsourced integrated sensing and communication (UNISAC), demonstrating its potential to significantly enhance user capacity and performance by leveraging spatial diversity.

Authors:Jianyu Wu, Yizhou Wang, Xiangyu Yue, Xinzhu Ma, Jingyang Guo, Dongzhan Zhou, Wanli Ouyang, Shixiang Tang
Title: CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Abstract:
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC.
中文: 本文提出了一个多模态CAD生成框架CMT,能够有效捕捉B-Rep的关键先验知识,并创建了大规模数据集mmABC,在条件和非条件CAD生成任务中均表现出优越性能。
English: This paper introduces a multimodal CAD generation framework called CMT that captures essential B-Rep priors and a large-scale dataset mmABC, demonstrating superior performance in both conditional and unconditional CAD generation tasks.

Authors:Anubhav Jain, Yuya Kobayashi, Naoki Murata, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji, Niv Cohen, Nasir Memon, Julian Togelius
Title: Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image
Abstract:
Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.
中文摘要:本文提出一种黑盒对抗攻击方法,通过扰动图像来操控其与初始噪声的映射关系,从而实现水印伪造或去除,揭示了现有水印方案存在的安全漏洞。
English Summary: This paper introduces a black-box adversarial attack that forges or removes watermarks in diffusion models by perturbing images to manipulate their mapping to initial noise, revealing vulnerabilities in current watermarking schemes.

Authors:Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He
Title: Marmot: Object-Level Self-Correction via Multi-Agent Reasoning
Abstract:
While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. One potential solution involves employing Multimodal Large Language Model (MLLM) as an AI agent to construct a self-correction framework. However, these approaches heavily rely on the capabilities of the MLLMs used, often fail to account for all objects within the image, and suffer from cumulative distortions during multi-round editing processes. To address these challenges, we propose Marmot, a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting to enhance image-text alignment. First, we employ a large language model as an Object-Aware Agent to perform object-level divide-and-conquer, automatically decomposing self-correction tasks into object-centric subtasks based on image descriptions. For each subtask, we construct an Object Correction System featuring a decision-execution-verification mechanism that operates exclusively on a single object's segmentation mask or the bounding boxes of object pairs, effectively mitigating inter-object interference and enhancing editing reliability. To efficiently integrate correction results from subtasks while avoiding cumulative distortions from multi-stage editing, we propose a Pixel-Domain Stitching Smoother, which employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtasks, significantly improving runtime efficiency while preventing distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
中文摘要:提出的Marmot框架通过多智能体推理和对象级任务分解,解决了扩散模型在多对象场景处理中的不足,采用并行子任务处理和像素域优化技术,显著提升了对象计数、属性分配和空间关系的准确性。
English Summary: The proposed Marmot framework addresses diffusion models' limitations in handling multi-object scenes by employing multi-agent reasoning and object-level task decomposition, achieving significant improvements in counting, attributes, and spatial relationships through parallel subtask processing and pixel-domain optimization.

Authors:Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu
Title: m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training
Abstract:
Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
中文: 本文提出了一种知识驱动的多智能体框架,用于生物医学语料精馏,能自主生成高质量问答对,使Llama3-70B等大语言模型在生物医学问答任务中超越先进专有模型。
English: This paper introduces a knowledge-driven, multi-agent framework for biomedical corpus distillation that autonomously generates high-quality question-answer pairs, enabling large language models like Llama3-70B to surpass advanced proprietary models in biomedical question-answering tasks.

Authors:Yifan Duan, Heng Li, Yilong Wu, Wenhao Yu, Xinran Zhang, Yedong Shen, Jianmin Ji, Yanyong Zhang
Title: STDArm: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation
Abstract:
Recent advances in mobile robotic platforms like quadruped robots and drones have spurred a demand for deploying visuomotor policies in increasingly dynamic environments. However, the collection of high-quality training data, the impact of platform motion and processing delays, and limited onboard computing resources pose significant barriers to existing solutions. In this work, we present STDArm, a system that directly transfers policies trained under static conditions to dynamic platforms without extensive modifications. The core of STDArm is a real-time action correction framework consisting of: (1) an action manager to boost control frequency and maintain temporal consistency, (2) a stabilizer with a lightweight prediction network to compensate for motion disturbances, and (3) an online latency estimation module for calibrating system parameters. In this way, STDArm achieves centimeter-level precision in mobile manipulation tasks. We conduct comprehensive evaluations of the proposed STDArm on two types of robotic arms, four types of mobile platforms, and three tasks. Experimental results indicate that the STDArm enables real-time compensation for platform motion disturbances while preserving the original policy's manipulation capabilities, achieving centimeter-level operational precision during robot motion.
中文摘要:STDArm系统通过实时动作校正框架,将静态条件下训练的策略直接迁移到动态移动平台,无需大量修改即可在机器人运动过程中实现厘米级操作精度。
English Summary: STDArm enables direct transfer of static-trained visuomotor policies to dynamic mobile platforms through real-time action correction, achieving centimeter-level precision in manipulation tasks without extensive retraining.

Authors:Qingqing Ye, Liantong Yu, Kai Huang, Xiaokui Xiao, Weiran Liu, Haibo Hu
Title: From Randomized Response to Randomized Index: Answering Subset Counting Queries with Local Differential Privacy
Abstract:
Local Differential Privacy (LDP) is the predominant privacy model for safeguarding individual data privacy. Existing perturbation mechanisms typically require perturbing the original values to ensure acceptable privacy, which inevitably results in value distortion and utility deterioration. In this work, we propose an alternative approach -- instead of perturbing values, we apply randomization to indexes of values while ensuring rigorous LDP guarantees. Inspired by the deniability of randomized indexes, we present CRIAD for answering subset counting queries on set-value data. By integrating a multi-dummy, multi-sample, and multi-group strategy, CRIAD serves as a fully scalable solution that offers flexibility across various privacy requirements and domain sizes, and achieves more accurate query results than any existing methods. Through comprehensive theoretical analysis and extensive experimental evaluations, we validate the effectiveness of CRIAD and demonstrate its superiority over traditional value-perturbation mechanisms.
中文摘要:本文提出CRIAD这一新型本地差分隐私方法,通过随机化数据索引而非扰动数值来提升集合计数查询的准确性,同时保持严格的隐私保护。
English Summary: This paper introduces CRIAD, a novel Local Differential Privacy method that randomizes data indexes rather than perturbing values to enhance accuracy in subset counting queries while maintaining strong privacy guarantees.

Authors:Weiliang Zhang, Xiaohan Huang, Yi Du, Ziyue Qiao, Qingqing Long, Zhen Meng, Yuanchun Zhou, Meng Xiao
Title: Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning
Abstract:
Feature selection aims to preprocess the target dataset, find an optimal and most streamlined feature subset, and enhance the downstream machine learning task. Among filter, wrapper, and embedded-based approaches, the reinforcement learning (RL)-based subspace exploration strategy provides a novel objective optimization-directed perspective and promising performance. Nevertheless, even with improved performance, current reinforcement learning approaches face challenges similar to conventional methods when dealing with complex datasets. These challenges stem from the inefficient paradigm of using one agent per feature and the inherent complexities present in the datasets. This observation motivates us to investigate and address the above issue and propose a novel approach, namely HRLFS. Our methodology initially employs a Large Language Model (LLM)-based hybrid state extractor to capture each feature's mathematical and semantic characteristics. Based on this information, features are clustered, facilitating the construction of hierarchical agents for each cluster and sub-cluster. Extensive experiments demonstrate the efficiency, scalability, and robustness of our approach. Compared to contemporary or the one-feature-one-agent RL-based approaches, HRLFS improves the downstream ML performance with iterative feature subspace exploration while accelerating total run time by reducing the number of agents involved.
中文摘要:特征选择通过HRLFS新方法,利用大型语言模型聚类特征并构建分层代理,相比传统强化学习方法减少了代理数量,从而提升机器学习性能与效率。
English Summary: Feature selection is optimized through a novel HRLFS approach that uses LLMs to cluster features and hierarchical agents, improving machine learning performance and efficiency by reducing agent numbers compared to traditional reinforcement learning methods.

Authors:Xiaohan Huang, Dongjie Wang, Zhiyuan Ning, Ziyue Qiao, Qingqing Long, Haowei Zhu, Yi Du, Min Wu, Yuanchun Zhou, Meng Xiao
Title: Collaborative Multi-Agent Reinforcement Learning for Automated Feature Transformation with Graph-Driven Path Optimization
Abstract:
Feature transformation methods aim to find an optimal mathematical feature-feature crossing process that generates high-value features and improves the performance of downstream machine learning tasks. Existing frameworks, though designed to mitigate manual costs, often treat feature transformations as isolated operations, ignoring dynamic dependencies between transformation steps. To address the limitations, we propose TCTO, a collaborative multi-agent reinforcement learning framework that automates feature engineering through graph-driven path optimization. The framework's core innovation lies in an evolving interaction graph that models features as nodes and transformations as edges. Through graph pruning and backtracking, it dynamically eliminates low-impact edges, reduces redundant operations, and enhances exploration stability. This graph also provides full traceability to empower TCTO to reuse high-utility subgraphs from historical transformations. To demonstrate the efficacy and adaptability of our approach, we conduct comprehensive experiments and case studies, which show superior performance across a range of datasets.
中文摘要:提出的TCTO框架采用多智能体强化学习和动态演化的交互图,通过优化特征转换路径、消除冗余操作并复用高价值子图来自动化特征工程,实验证明其具有优越性能。
English Summary: The proposed TCTO framework employs multi-agent reinforcement learning with an evolving interaction graph to automate feature engineering by dynamically optimizing transformation paths, eliminating redundancies, and reusing high-value subgraphs, demonstrating superior performance in experiments.

Authors:Chiung-Yi Tseng, Junhao Song, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Ming Liu
Title: Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement
Abstract:
In the era of data-driven intelligence, the paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in the advancement of machine learning. This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses how it is used in various fields such as computer vision, natural language processing, transfer learning, and real-world applications. The paper focuses on important research topics such as uncertainty estimation, handling of class imbalance, domain adaptation, fairness, and the creation of strong evaluation metrics and benchmarks. It also shows that learning methods inspired by humans and guided by questions can improve data efficiency and help models learn more effectively. In addition, this paper talks about current challenges in the field, including the need to rebuild trust, ensure reproducibility, and deal with inconsistent methodologies. It points out that AL often gives better results than passive learning, especially when good evaluation measures are used. This work aims to be useful for both researchers and practitioners by providing key insights and proposing directions for future progress in active learning.
中文摘要:本文全面综述了主动学习,阐述了其如何利用较少标注样本提升机器学习性能,并探讨了关键研究方向及当前领域挑战。
English Summary: This paper provides a comprehensive overview of Active Learning, highlighting its ability to enhance machine learning performance with fewer labeled examples while addressing key research areas and current challenges in the field.

Authors:Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, James Tanner, Quan Vuong, Homer Walke, Anna Walling, Haohuan Wang, Lili Yu, Ury Zhilinsky
Title: $π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Abstract:
In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $π_{0.5}$, a new model based on $π_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $π_{0.5}$\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.
中文:$π_{0.5}$模型通过多源数据协同训练,实现了机器人操作能力的广泛泛化,首次在全新家庭环境中完成厨房清洁等复杂长周期任务。
English: The $π_{0.5}$ model enhances robotic manipulation by co-training on diverse data sources, enabling unprecedented generalization to perform complex tasks like kitchen cleaning in novel environments.

Authors:Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yifan Zhang, Haochen Tian, Ivan Vulić, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
Title: Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world, thereby limiting their broader applications. We argue that spatial reasoning capabilities will not naturally emerge from merely scaling existing architectures and training methodologies. Instead, this challenge demands dedicated attention to fundamental modifications in the current MLLM development approach. In this position paper, we first establish a comprehensive framework for spatial reasoning within the context of MLLMs. We then elaborate on its pivotal role in real-world applications. Through systematic analysis, we examine how individual components of the current methodology, from training data to reasoning mechanisms, influence spatial reasoning capabilities. This examination reveals critical limitations while simultaneously identifying promising avenues for advancement. Our work aims to direct the AI research community's attention toward these crucial yet underexplored aspects. By highlighting these challenges and opportunities, we seek to catalyze progress toward achieving human-like spatial reasoning capabilities in MLLMs.
中文: 多模态大语言模型在空间推理方面存在显著不足,这限制了其实际应用,需要从根本上改变开发方法,而非简单扩展。
English: Multimodal Large Language Models exhibit significant limitations in spatial reasoning, which restricts their real-world applications, necessitating fundamental changes in development approaches rather than simple scaling.

Authors:Rong Du, Qingqing Ye, Yaxin Xiao, Liantong Yu, Yue Fu, Haibo Hu
Title: Dual Utilization of Perturbation for Stream Data Publication under Local Differential Privacy
Abstract:
Stream data from real-time distributed systems such as IoT, tele-health, and crowdsourcing has become an important data source. However, the collection and analysis of user-generated stream data raise privacy concerns due to the potential exposure of sensitive information. To address these concerns, local differential privacy (LDP) has emerged as a promising standard. Nevertheless, applying LDP to stream data presents significant challenges, as stream data often involves a large or even infinite number of values. Allocating a given privacy budget across these data points would introduce overwhelming LDP noise to the original stream data. Beyond existing approaches that merely use perturbed values for estimating statistics, our design leverages them for both perturbation and estimation. This dual utilization arises from a key observation: each user knows their own ground truth and perturbed values, enabling a precise computation of the deviation error caused by perturbation. By incorporating this deviation into the perturbation process of subsequent values, the previous noise can be calibrated. Following this insight, we introduce the Iterative Perturbation Parameterization (IPP) method, which utilizes current perturbed results to calibrate the subsequent perturbation process. To enhance the robustness of calibration and reduce sensitivity, two algorithms, namely Accumulated Perturbation Parameterization (APP) and Clipped Accumulated Perturbation Parameterization (CAPP) are further developed. We prove that these three algorithms satisfy $w$-event differential privacy while significantly improving utility. Experimental results demonstrate that our techniques outperform state-of-the-art LDP stream publishing solutions in terms of utility, while retaining the same privacy guarantee.
中文: 提出的迭代扰动参数化方法利用用户对自身数据偏差的认知来校准流数据中的噪声,在保持w-事件差分隐私的同时,相比现有方法实现了更优的数据效用。
English: The proposed Iterative Perturbation Parameterization method leverages users' knowledge of their own data deviations to calibrate noise in stream data, achieving superior utility while maintaining w-event differential privacy compared to existing approaches.

Authors:Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu
Title: Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Abstract:
Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.
中文: 知识蒸馏通过将复杂教师模型的知识迁移至精简学生模型,有效提升模型效率和精度,其最新进展在优化性能与模型压缩方面展现出显著成效。
English: Knowledge distillation effectively transfers knowledge from complex teacher models to simpler student models, enhancing efficiency and accuracy across various AI applications, with recent innovations improving performance and model compression.

Authors:Yulian Mao, Qingqing Ye, Rong Du, Qi Wang, Kai Huang, Haibo Hu
Title: Multi-class Item Mining under Local Differential Privacy
Abstract:
Item mining, a fundamental task for collecting statistical data from users, has raised increasing privacy concerns. To address these concerns, local differential privacy (LDP) was proposed as a privacy-preserving technique. Existing LDP item mining mechanisms primarily concentrate on global statistics, i.e., those from the entire dataset. Nevertheless, they fall short of user-tailored tasks such as personalized recommendations, whereas classwise statistics can improve task accuracy with fine-grained information. Meanwhile, the introduction of class labels brings new challenges. Label perturbation may result in invalid items for aggregation. To this end, we propose frameworks for multi-class item mining, along with two mechanisms: validity perturbation to reduce the impact of invalid data, and correlated perturbation to preserve the relationship between labels and items. We also apply these optimized methods to two multi-class item mining queries: frequency estimation and top-$k$ item mining. Through theoretical analysis and extensive experiments, we verify the effectiveness and superiority of these methods.
中文: 本文提出了本地差分隐私下的多类别项目挖掘新框架,通过有效性扰动和关联扰动机制解决标签扰动带来的挑战,在频率估计和Top-k挖掘等任务中验证了方法的优越性。
English: This paper introduces novel frameworks for multi-class item mining under local differential privacy, featuring validity and correlated perturbation mechanisms to enhance accuracy in tasks like frequency estimation and top-k mining while addressing challenges from label perturbation.

Authors:Shumin Wang, Zhuoran Yang, Lidian Wang, Zhipeng Tang, Heng Li, Lehan Pan, Sha Zhang, Jie Peng, Jianmin Ji, Yanyong Zhang
Title: Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving
Abstract:
The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
中文: 本文提出一种利用海量未标注数据进行自监督预训练的框架,通过结合提示适配器的领域适应策略,显著提升了自动驾驶3D感知模型在下游任务中的性能,并展现出随数据规模扩大而持续优化的潜力。
English: This paper introduces a self-supervised pre-training framework using massive unlabeled data to enhance 3D perception models for autonomous driving, achieving significant improvements in downstream tasks and demonstrating scalable performance with increasing data volume.

Authors:Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji
Title: SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing
Abstract:
Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Audio examples are available on https://steermusic.pages.dev/.
中文摘要:本文提出SteerMusic和SteerMusic+两种音乐编辑方法,通过分数蒸馏技术有效保持原始音乐内容一致性,其中后者还能实现超越文本描述的个性化风格编辑,实验证明其编辑质量显著优于现有方法。
English Summary: This paper introduces SteerMusic and SteerMusic+, two novel music editing methods that utilize score distillation to better preserve content consistency and achieve superior editing fidelity compared to existing approaches, with the latter enabling personalized style manipulation beyond text instructions.

Authors:Osvaldo Simeone, Sangwoo Park, Matteo Zecchin
Title: Conformal Calibration: Ensuring the Reliability of Black-Box AI in Wireless Systems
Abstract:
AI is poised to revolutionize telecommunication networks by boosting efficiency, automation, and decision-making. However, the black-box nature of most AI models introduces substantial risk, possibly deterring adoption by network operators. These risks are not addressed by the current prevailing deployment strategy, which typically follows a best-effort train-and-deploy paradigm. This paper reviews conformal calibration, a general framework that moves beyond the state of the art by adopting computationally lightweight, advanced statistical tools that offer formal reliability guarantees without requiring further training or fine-tuning. Conformal calibration encompasses pre-deployment calibration via uncertainty quantification or hyperparameter selection; online monitoring to detect and mitigate failures in real time; and counterfactual post-deployment performance analysis to address "what if" diagnostic questions after deployment. By weaving conformal calibration into the AI model lifecycle, network operators can establish confidence in black-box AI models as a dependable enabling technology for wireless systems.
Chinese: 保形校准通过部署前标定、实时监测和部署后分析,为电信领域中的AI模型提供轻量级统计框架和正式可靠性保证,从而解决黑箱模型风险并建立运营商信任。
English: Conformal calibration offers a lightweight statistical framework that provides formal reliability guarantees for AI models in telecommunications, addressing risks through pre-deployment calibration, real-time monitoring, and post-deployment analysis to build operator confidence.

Authors:Jiawei Duan, Haibo Hu, Qingqing Ye, Xinyue Sun
Title: Technical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically
Abstract:
Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensitivity of gradients and amplifying privacy bounds to save privacy budgets) are proposed to trade privacy for model efficiency, the root cause of its inefficiency is yet unveiled. In this work, we first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on direction has eminent impact on the model efficiency while that on magnitude can be mitigated by optimization techniques, i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that traditional DP introduces biased noise on the direction when adding unbiased noise to the gradient itself. Overall, the perturbation of DP-SGD is actually sub-optimal from a geometric perspective. Motivated by this, we design a geometric perturbation strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient, respectively. By directly reducing the noise on the direction, GeoDP mitigates the negative impact of DP noise on model efficiency with the same DP guarantee. Extensive experiments on two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the effectiveness and generality of our strategy.
中文: DP-SGD中的差分隐私机制因对梯度方向引入噪声而导致模型效率低下,为此提出的GeoDP方法通过分别扰动梯度的方向和幅度,在保证隐私的同时有效提升了模型性能。
English: Differential privacy in DP-SGD introduces inefficient noise primarily affecting gradient direction, prompting the development of GeoDP, which separately perturbs direction and magnitude to enhance model efficiency while maintaining privacy guarantees.

Authors:Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs
Title: Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs
Abstract:
Multi-head attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives most attention despite limited semantic importance, challenge our understanding of multi-head attention. To analyze this phenomenon, we propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. We compare our definition to prior work in a model intervention study where we test whether dormant heads matter for inference by zeroing out the output of dormant attention heads. Using six pretrained models and five benchmark datasets, we find our definition to be more model and dataset-agnostic. Using our definition on most models, more than 4% of a model's attention heads can be zeroed while maintaining average accuracy, and zeroing more than 14% of a model's attention heads can keep accuracy to within 1% of the pretrained model's average accuracy. Further analysis reveals that dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining. Additionally, we provide evidence that they depend on characteristics of the input text.
中文: 大语言模型中的注意力汇聚现象导致计算冗余,研究发现超过12%的注意力头处于非活跃状态但可被移除而不影响精度,新提出的评分函数比传统基于注意力的方法更能有效识别这些头。
English: Attention sinks in large language models lead to computational redundancy, with over 12% of heads being inactive yet removable while preserving accuracy, as revealed by novel score functions that outperform traditional attention-based metrics.

Authors:Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs
Title: Identifying and Evaluating Inactive Heads in Pretrained LLMs
Abstract:
Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we propose a taxonomy of 13 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present markedly different attention behaviors.
中文: 大语言模型中的注意力汇聚现象导致计算冗余,研究发现超过12%的注意力头处于非活跃状态但可被移除而不影响精度,新提出的评分函数比传统基于注意力的方法更能有效识别这些头。
English: Attention sinks in large language models lead to computational redundancy, with over 12% of heads being inactive yet removable while preserving accuracy, as revealed by novel score functions that outperform traditional attention-based metrics.

Authors:Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan
Title: MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Abstract:
Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.
中文: 本文提出一个综合评估框架,通过标准化传统任务和引入新型混合模态生成测试来解决现有MLLM基准的不足,在对12个领先统一多模态模型的系统测试中揭示了当前模型存在的显著性能差距。
English: This paper introduces a comprehensive evaluation framework addressing the limitations of existing MLLM benchmarks by standardizing traditional tasks and introducing novel mixed-modality generation assessments, revealing significant performance gaps in current unified multimodal models through systematic testing of 12 leading systems.

Authors:Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, Zhou Zhao
Title: RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Abstract:
Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
Chinese: 本研究提出RoboGround系统,通过将接地掩码作为中间表示来增强机器人操作策略的泛化能力,结合空间引导和大规模视觉语言模型,实验验证了该方法的有效性。
English: This research introduces RoboGround, a robotic manipulation system that utilizes grounding masks as intermediate representations to enhance policy generalization by providing spatial guidance and leveraging large-scale vision-language models, with experiments confirming its effectiveness.

Authors:Guobiao Li, Lei Tan, Yuliang Xue, Gaozhi Liu, Zhenxing Qian, Sheng Li, Xinpeng Zhang
Title: Adversarial Shallow Watermarking
Abstract:
Recent advances in digital watermarking make use of deep neural networks for message embedding and extraction. They typically follow the ``encoder-noise layer-decoder''-based architecture. By deliberately establishing a differentiable noise layer to simulate the distortion of the watermarked signal, they jointly train the deep encoder and decoder to fit the noise layer to guarantee robustness. As a result, they are usually weak against unknown distortions that are not used in their training pipeline. In this paper, we propose a novel watermarking framework to resist unknown distortions, namely Adversarial Shallow Watermarking (ASW). ASW utilizes only a shallow decoder that is randomly parameterized and designed to be insensitive to distortions for watermarking extraction. During the watermark embedding, ASW freezes the shallow decoder and adversarially optimizes a host image until its updated version (i.e., the watermarked image) stably triggers the shallow decoder to output the watermark message. During the watermark extraction, it accurately recovers the message from the watermarked image by leveraging the insensitive nature of the shallow decoder against arbitrary distortions. Our ASW is training-free, encoder-free, and noise layer-free. Experiments indicate that the watermarked images created by ASW have strong robustness against various unknown distortions. Compared to the existing ``encoder-noise layer-decoder'' approaches, ASW achieves comparable results on known distortions and better robustness on unknown distortions.
中文摘要:现有基于深度神经网络的水印方法对未知失真鲁棒性差,而提出的对抗性浅层水印(ASW)框架通过固定浅层解码器和对抗性优化,无需训练即可实现强鲁棒性,尤其对未知失真表现优异。
English Summary: Recent deep learning-based watermarking methods are vulnerable to unknown distortions, but the proposed Adversarial Shallow Watermarking (ASW) framework overcomes this limitation by using a fixed shallow decoder and adversarial optimization to achieve strong robustness without requiring training or explicit noise modeling.

Authors:Jiaqi Peng, Tai Wang, Jiangmiao Pang, Yuan Shen
Title: Towards Latency-Aware 3D Streaming Perception for Autonomous Driving
Abstract:
Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80\% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.
Chinese: 现有3D感知算法在边缘设备上因运行时延面临挑战,为此我们提出了延迟感知基准和LASP框架,通过历史特征整合与预测性检测模块,在未加速情况下实现了接近离线性能的在线表现。
English: Existing 3D perception algorithms face runtime latency challenges on edge devices, prompting the development of a latency-aware benchmark and a LASP framework that integrates historical features and predictive detection to achieve near-offline performance.

Authors:Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Penghai Zhao, Tianlin Li, Xiaojun Jia, Xiang Li, Xingshuai Li, Yang Liu, Yebo Feng, Yihao Huang, Yijia Xu, Yuqiang Sun, Zhenhong Zhou, Zhengzi Xu
Title: A Vision for Auto Research with LLM Agents
Abstract:
This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.
中文: 本文提出基于智能体的自动研究框架,利用大语言模型通过模块化智能体协作实现科研全流程自动化与优化,涵盖文献综述到成果传播各环节,旨在解决工作流碎片化与方法论认知负荷问题,初步验证了该自进化AI驱动科研范式的可行性与潜力。
English: This paper presents Agent-Based Auto Research, a multi-agent framework using large language models to automate and optimize the entire scientific research lifecycle, from literature review to dissemination, addressing workflow fragmentation and cognitive overload while demonstrating feasibility for AI-driven, self-improving research processes.

Authors:Xin Li, Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Amal Youssef, Yalin Wang
Title: A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images
Abstract:
In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction of Vision Transformers (ViT) and self-supervised learning provides a pre-training strategy that utilizes abundant unlabeled data, effectively alleviating the label acquisition challenge while broadening the breadth of data utilization. However, ViT's high computational density and substantial demand for computing power, coupled with the lack of localization characteristics of its operations on image patches, limit its efficiency and applicability in many application scenarios. In this study, we employ nn-MobileNet, a lightweight CNN framework, to implement a BERT-style self-supervised learning approach. We pre-train the network on the unlabeled retinal fundus images from the UK Biobank to improve downstream application performance. We validate the results of the pre-trained model on Alzheimer's disease (AD), Parkinson's disease (PD), and various retinal diseases identification. The results show that our approach can significantly improve performance in the downstream tasks. In summary, this study combines the benefits of CNNs with the capabilities of advanced self-supervised learning in handling large-scale unlabeled data, demonstrating the potential of CNNs in the presence of label scarcity.
中文: 本研究采用轻量级CNN框架结合自监督学习,利用未标记视网膜图像进行预训练,显著提升了神经退行性疾病和视网膜疾病的识别性能,有效应对了医学影像中标签稀缺的挑战。
English: This study employs a lightweight CNN framework with self-supervised learning on unlabeled retinal images, significantly enhancing performance in identifying neurological and retinal diseases while addressing label scarcity in medical imaging.

Authors:Huaizhi Qu, Inyoung Choi, Zhen Tan, Song Wang, Sukwon Yun, Qi Long, Faizan Siddiqui, Kwonjoon Lee, Tianlong Chen
Title: Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Abstract:
LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.
中文: 本文提出BetaConform框架,通过贝塔-二项分布建模、自适应共形停止和先验迁移机制,仅用10个标注样本就能在TruthfulQA数据集上以3.37%的误差精确评估大语言模型集成判断的性能。
English: This paper introduces BetaConform, a maximum a posteriori framework that uses Beta-Binomial modeling, adaptive conformal stopping, and prior transfer to efficiently estimate LLM ensemble judgment accuracy with minimal labeled data, achieving a 3.37% error margin on TruthfulQA with just 10 samples.

Authors:Run Wang, Gamze Islamoglu, Andrea Belano, Viviane Potocnik, Francesco Conti, Angelo Garofalo, Luca Benini
Title: VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
Abstract:
While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$\times$ less latency and 74.3$\times$ less energy compared to the baseline cluster, achieving an 8.2$\times$ performance improvement and 4.1$\times$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$\times$ and 3.6$\times$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.
中文: 针对Transformer中Softmax的性能瓶颈,我们基于Schraudolph近似算法设计了Bfloat16指数运算定制模块,以微小面积开销集成到RISC-V核中,在无需重新训练的情况下为GPT-2、ViT等模型实现了显著的延迟和能耗降低。
English: To overcome the performance bottleneck of Softmax in Transformers, we developed a custom Bfloat16 exponentiation block using Schraudolph's approximation, integrated it into RISC-V cores with minimal area overhead, achieving significant latency and energy reductions for models like GPT-2 and ViT without retraining.

Authors:Mengyao Wang, Jiayun Wu, Shuai Ma, Nuo Li, Peng Zhang, Ning Gu, Tun Lu
Title: Adaptive Human-Agent Teaming: A Review of Empirical Studies from the Process Dynamics Perspective
Abstract:
The rapid advancement of AI, including Large Language Models, has propelled autonomous agents forward, accelerating the human-agent teaming (HAT) paradigm to leverage complementary strengths. However, HAT research remains fragmented, often focusing on isolated team development phases or specific challenges like trust calibration while overlooking the real-world need for adaptability. Addressing these gaps, a process dynamics perspective is adopted to systematically review HAT using the T$^4$ framework: Team Formation, Task and Role Development, Team Development, and Team Improvement. Each phase is examined in terms of its goals, actions, and evaluation metrics, emphasizing the co-evolution of task and team dynamics. Special focus is given to the second and third phases, highlighting key factors such as team roles, shared mental model, and backup behaviors. This holistic perspective identifies future research directions for advancing long-term adaptive HAT.
Chinese: 摘要主张采用过程动态视角,通过T⁴框架系统审视人机协作(HAT),分析团队形成、任务与角色发展、团队发展及改进阶段,以解决现有研究碎片化问题并推动长期适应性发展。
English: The abstract advocates for a process dynamics perspective using the T⁴ framework to systematically review human-agent teaming (HAT), addressing its fragmented research by examining team formation, task and role development, team development, and improvement phases to foster long-term adaptability.

Authors:Peiliang Gong, Emadeldeen Eldele, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang
Title: Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization
Abstract:
Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.
中文: 本文提出ProtoNorm这一领域感知的自适应归一化策略,通过替换Transformer中的LayerNorm来解决时序基础模型中数据分布不匹配的问题,显著提升了分类和预测任务的性能,同时有效缓解了预训练中的分布偏移影响。
English: This paper introduces ProtoNorm, a domain-aware adaptive normalization strategy that replaces LayerNorm in Transformers to address data distribution mismatches in time series foundation models, significantly improving performance across classification and forecasting tasks while mitigating distribution shift effects.

Authors:Zihan Wang, Jinghao Lin, Xiaocui Yang, Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang
Title: Enhancing LLM-based Recommendation through Semantic-Aligned Collaborative Knowledge
Abstract:
Large Language Models (LLMs) demonstrate remarkable capabilities in leveraging comprehensive world knowledge and sophisticated reasoning mechanisms for recommendation tasks. However, a notable limitation lies in their inability to effectively model sparse identifiers (e.g., user and item IDs), unlike conventional collaborative filtering models (Collabs.), thus hindering LLM to learn distinctive user-item representations and creating a performance bottleneck. Prior studies indicate that integrating collaborative knowledge from Collabs. into LLMs can mitigate the above limitations and enhance their recommendation performance. Nevertheless, the significant discrepancy in knowledge distribution and semantic space between LLMs and Collab. presents substantial challenges for effective knowledge transfer. To tackle these challenges, we propose a novel framework, SeLLa-Rec, which focuses on achieving alignment between the semantic spaces of Collabs. and LLMs. This alignment fosters effective knowledge fusion, mitigating the influence of discriminative noise and facilitating the deep integration of knowledge from diverse models. Specifically, three special tokens with collaborative knowledge are embedded into the LLM's semantic space through a hybrid projection layer and integrated into task-specific prompts to guide the recommendation process. Experiments conducted on two public benchmark datasets (MovieLens-1M and Amazon Book) demonstrate that SeLLa-Rec achieves state-of-the-art performance.
中文: 大语言模型在推荐任务中表现出色,但难以处理稀疏标识符,为此提出的SeLLa-Rec框架通过对齐大语言模型与协同过滤模型的语义空间,实现有效知识融合,从而获得最优性能。
English: Large Language Models (LLMs) excel in recommendation tasks but struggle with sparse identifiers, which the proposed SeLLa-Rec framework addresses by aligning semantic spaces between LLMs and collaborative filtering models to enable effective knowledge fusion and achieve state-of-the-art performance.

Authors:Pucheng Dang, Di Huang, Dong Li, Kang Chen, Yuanbo Wen, Qi Guo, Xing Hu
Title: MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions
Abstract:
Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 74.07 for migration tasks.
Chinese: MigGPT提出了一种创新框架,通过代码指纹结构和三个精心设计的模块显著提升了Linux内核树外补丁迁移的准确性与效率,平均完成率达74.07%,大幅优于直接使用大型语言模型的表现。
English: MigGPT introduces a novel framework utilizing code fingerprints and specialized modules to significantly enhance the accuracy and efficiency of migrating out-of-tree Linux kernel patches, achieving a 74.07% average completion rate and outperforming standard large language models.

Authors:Pucheng Dang, Di Huang, Dong Li, Kang Chen, Yuanbo Wen, Qi Guo, Xing Hu
Title: MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions
Abstract:
Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 74.07 for migration tasks.
Chinese: MigGPT提出了一种创新框架,通过代码指纹结构和三个精心设计的模块显著提升了Linux内核树外补丁迁移的准确性与效率,平均完成率达74.07%,大幅优于直接使用大型语言模型的表现。
English: MigGPT introduces a novel framework utilizing code fingerprints and specialized modules to significantly enhance the accuracy and efficiency of migrating out-of-tree Linux kernel patches, achieving a 74.07% average completion rate and outperforming standard large language models.

Authors:Haokai Ma, Yunshan Ma, Ruobing Xie, Lei Meng, Jialie Shen, Xingwu Sun, Zhanhui Kang, Tat-Seng Chua
Title: Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training
Abstract:
Recent research efforts have investigated how to integrate Large Language Models (LLMs) into recommendation, capitalizing on their semantic comprehension and open-world knowledge for user behavior understanding. These approaches predominantly employ supervised fine-tuning on single-domain user interactions to adapt LLMs for specific recommendation tasks. However, they typically encounter dual challenges: the mismatch between general language representations and domain-specific preference patterns, as well as the limited adaptability to multi-domain recommendation scenarios. To bridge these gaps, we introduce CPRec -- an All-domain Continual Pre-Training framework for Recommendation -- designed to holistically align LLMs with universal user behaviors through the continual pre-training paradigm. Specifically, we first design a unified prompt template and organize users' multi-domain behaviors into domain-specific behavioral sequences and all-domain mixed behavioral sequences that emulate real-world user decision logic. To optimize behavioral knowledge infusion, we devise a Warmup-Stable-Annealing learning rate schedule tailored for the continual pre-training paradigm in recommendation to progressively enhance the LLM's capability in knowledge adaptation from open-world knowledge to universal recommendation tasks. To evaluate the effectiveness of our CPRec, we implement it on a large-scale dataset covering seven domains and conduct extensive experiments on five real-world datasets from two distinct platforms. Experimental results confirm that our continual pre-training paradigm significantly mitigates the semantic-behavioral discrepancy and achieves state-of-the-art performance in all recommendation scenarios. The source code will be released upon acceptance.
中文摘要:CPRec框架通过跨领域的持续预训练,有效弥合了通用语言模型与推荐任务之间的语义差异,利用统一的行为序列建模和多阶段学习策略,在各类推荐场景中实现了最优性能。
English Summary: The CPRec framework addresses limitations in adapting Large Language Models for recommendation systems by employing continual pre-training across multiple domains, effectively aligning general language knowledge with user behavior patterns to achieve superior performance.

Authors:Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu
Title: OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM
Abstract:
Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that LLaVA model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5\% to 93.2\% accuracy on the Adience dataset for age estimation, and from 30.0\% to 85.7\% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets. Project Page: https://order-chain.github.io/.
中文: OrderChain是一种新颖的提示范式,通过特性与共性建模增强多模态大语言模型的序数理解能力,在多种序数回归任务上显著提升了性能表现。
English: OrderChain is a novel prompting paradigm that enhances multimodal large language models' ordinal understanding through specificity and commonality modeling, significantly improving performance across various ordinal regression tasks.

Authors:Victor J. B. Jung, Alessio Burrello, Francesco Conti, Luca Benini
Title: Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches
Abstract:
The success of DNNs and their high computational requirements pushed for large codesign efforts aiming at DNN acceleration. Since DNNs can be represented as static computational graphs, static memory allocation and tiling are two crucial optimizations. Hence, SoCs specialized for DNN acceleration commonly features a multi-level software-managed memory hierarchy. In such architecture, layer-wise tiling, i.e., splitting each layer into multiple sub-nodes, is commonly used; however, while reducing memory occupation, it can increase the total memory transfer, ultimately causing costly off-chip memory copies, which impact energy efficiency and create memory bottlenecks. This work proposes Fused-Tiled Layers, a novel algorithm for automatic fusion between tiled layers. We leverage the flexibility and efficiency of a RISC-V (RV32) heterogeneous SoC to integrate FTL in an open-source deployment framework, which we tune for RISC-V targets. We demonstrate that FTL brings up to 60.1% runtime reduction for a typical MLP stage of ViT due to the reduction of off-chip transfer and on-chip data movement by 47.1%.
Chinese: 本研究提出融合平铺层(FTL)算法,通过在RISC-V SoC上优化平铺层融合,实现运行时间最高减少60.1%,片外数据传输降低47.1%。
English: This work introduces Fused-Tiled Layers (FTL), an algorithm that optimizes DNN acceleration by fusing tiled layers, reducing runtime by up to 60.1% and off-chip transfers by 47.1% on a RISC-V SoC.

Authors:Zhuoran Yang, Jie Peng, Zhen Tan, Tianlong Chen, Yanyong Zhang
Title: LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
Abstract:
Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for defending against jailbreak attacks are primarily based on auxiliary models. These strategies, however, often require extensive data collection or training. We propose LightDefense, a lightweight defense mechanism targeted at white-box models, which utilizes a safety-oriented direction to adjust the probabilities of tokens in the vocabulary, making safety disclaimers appear among the top tokens after sorting tokens by probability in descending order. We further innovatively leverage LLM's uncertainty about prompts to measure their harmfulness and adaptively adjust defense strength, effectively balancing safety and helpfulness. The effectiveness of LightDefense in defending against 5 attack methods across 2 target LLMs, without compromising helpfulness to benign user queries, highlights its potential as a novel and lightweight defense mechanism, enhancing security of LLMs.
中文摘要:LightDefense是一种轻量级防御机制,通过调整词汇概率分布并利用模型不确定性自适应防御强度,在保护白盒大语言模型免受越狱攻击的同时保持对正常查询的响应能力。
English Summary: LightDefense is a lightweight security mechanism that protects white-box LLMs by adjusting token probabilities and adaptively using model uncertainty to balance safety and helpfulness against jailbreak attacks.

Authors:Emadeldeen Eldele, Mohamed Ragab, Xu Qing, Edward, Zhenghua Chen, Min Wu, Xiaoli Li, Jay Lee
Title: UniFault: A Fault Diagnosis Foundation Model from Bearing Data
Abstract:
Machine fault diagnosis (FD) is a critical task for predictive maintenance, enabling early fault detection and preventing unexpected failures. Despite its importance, existing FD models are operation-specific with limited generalization across diverse datasets. Foundation models (FM) have demonstrated remarkable potential in both visual and language domains, achieving impressive generalization capabilities even with minimal data through few-shot or zero-shot learning. However, translating these advances to FD presents unique hurdles. Unlike the large-scale, cohesive datasets available for images and text, FD datasets are typically smaller and more heterogeneous, with significant variations in sampling frequencies and the number of channels across different systems and applications. This heterogeneity complicates the design of a universal architecture capable of effectively processing such diverse data while maintaining robust feature extraction and learning capabilities. In this paper, we introduce UniFault, a foundation model for fault diagnosis that systematically addresses these issues. Specifically, the model incorporates a comprehensive data harmonization pipeline featuring two key innovations. First, a unification scheme transforms multivariate inputs into standardized univariate sequences while retaining local inter-channel relationships. Second, a novel cross-domain temporal fusion strategy mitigates distribution shifts and enriches sample diversity and count, improving the model generalization across varying conditions. UniFault is pretrained on over 9 billion data points spanning diverse FD datasets, enabling superior few-shot performance. Extensive experiments on real-world FD datasets demonstrate that UniFault achieves SoTA performance, setting a new benchmark for fault diagnosis models and paving the way for more scalable and robust predictive maintenance solutions.
中文: UniFault是一种用于机器故障诊断的基础模型,通过数据协调和跨域时序融合解决了数据集异质性问题,在不同条件下实现了最先进的少样本性能。
English: UniFault is a foundation model for machine fault diagnosis that overcomes dataset heterogeneity through data harmonization and cross-domain temporal fusion, achieving state-of-the-art few-shot performance across diverse conditions.

Authors:Yujian Xiong, Xuanzhao Dong, Sebastian Waz, Wenhui Zhu, Negar Mallak, Zhong-lin Lu, Yalin Wang
Title: Schrödinger Diffusion Driven Signal Recovery in 3T BOLD fMRI Using Unmatched 7T Observations
Abstract:
Ultra-high-field (7 Tesla) BOLD fMRI offers exceptional detail in both spatial and temporal domains, along with robust signal-to-noise characteristics, making it a powerful modality for studying visual information processing in the brain. However, due to the limited accessibility of 7T scanners, the majority of neuroimaging studies are still conducted using 3T systems, which inherently suffer from reduced fidelity in both resolution and SNR. To mitigate this limitation, we introduce a new computational approach designed to enhance the quality of 3T BOLD fMRI acquisitions. Specifically, we project both 3T and 7T datasets, sourced from different individuals and experimental setups, into a shared low-dimensional representation space. Within this space, we employ a lightweight, unsupervised Schrödinger Bridge framework to infer a high-SNR, high-resolution counterpart of the 3T data, without relying on paired supervision. This methodology is evaluated across multiple fMRI retinotopy datasets, including synthetically generated samples, and demonstrates a marked improvement in the reliability and fit of population receptive field (pRF) models applied to the enhanced 3T outputs. Our findings suggest that it is feasible to computationally approximate 7T-level quality from standard 3T acquisitions.
中文: 一种新型无监督计算方法通过施罗德桥框架将3T fMRI数据与7T数据投影至共享表征空间,无需配对监督即可显著提升3T数据质量,使其达到接近7T的分辨率并改善群体感受野模型的拟合效果。
English: A new unsupervised computational method using a Schrödinger Bridge framework enhances 3T fMRI data quality by projecting it into a shared space with 7T data, achieving near-7T resolution and improving pRF model reliability without paired supervision.

Authors:Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Yun, Charles Flemming, Tianlong Chen
Title: $\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Abstract:
Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
中文: 本研究提出了一种针对多智能体大语言模型的排列不变对抗攻击,该攻击通过优化受限网络中的提示分布来绕过分布式安全机制,在显著超越传统方法的同时成功规避现有防御措施。
English: This study introduces a permutation-invariant adversarial attack targeting multi-agent LLM systems, which bypasses distributed safety mechanisms by optimizing prompt distribution across constrained networks and significantly outperforms traditional methods while evading existing defenses.

Authors:Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Yun, Charles Fleming, Tianlong Chen
Title: $\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Abstract:
Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
中文: 本研究提出了一种针对多智能体大语言模型的排列不变对抗攻击,该攻击通过优化受限网络中的提示分布来绕过分布式安全机制,在显著超越传统方法的同时成功规避现有防御措施。
English: This study introduces a permutation-invariant adversarial attack targeting multi-agent LLM systems, which bypasses distributed safety mechanisms by optimizing prompt distribution across constrained networks and significantly outperforms traditional methods while evading existing defenses.

Authors:Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas, Dimitrios K. Nasiopoulos
Title: Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks
Abstract:
Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging the PlantVillage dataset, we systematically evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios. A comparative analysis between GPT-4o and the widely used ResNet-50 model was conducted across three resolutions (100, 150, and 256 pixels) and two plant species (apple and corn). Results indicate that fine-tuned GPT-4o models achieved slightly better performance compared to the performance of ResNet-50, achieving up to 98.12% classification accuracy on apple leaf images, compared to 96.88% achieved by ResNet-50, with improved generalization and near-zero training loss. However, zero-shot performance of GPT-4o was significantly lower, underscoring the need for minimal training. Additional evaluations on cross-resolution and cross-plant generalization revealed the models' adaptability and limitations when applied to new domains. The findings highlight the promise of integrating multimodal LLMs into automated disease detection pipelines, enhancing the scalability and intelligence of precision agriculture systems while reducing the dependence on large, labeled datasets and high-resolution sensor infrastructure. Large Language Models, Vision Language Models, LLMs and CNNs, Disease Detection with Vision Language Models, VLMs
中文: 研究表明,结合卷积神经网络的多模态大语言模型GPT-4o经过微调后,在植物病害分类准确率上优于传统ResNet-50模型,能以更少训练数据提升农业自动化监测效能。
English: This study demonstrates that fine-tuned multimodal LLMs like GPT-4o combined with CNNs can achieve superior plant disease classification accuracy compared to traditional models like ResNet-50, enhancing automated agricultural monitoring with minimal training data requirements.

Authors:Juntian Zhang, Chuanqi cheng, Yuhan Liu, Wei Liu, Jian Luan, Rui Yan
Title: Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
Abstract:
Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.
Chinese: 本研究提出焦点中心视觉链范式及其数据合成方法,旨在提升视觉语言模型在多图像场景下的性能,在七个基准测试中取得显著进步且不影响通用能力。
English: This study introduces the Focus-Centric Visual Chain paradigm and its corresponding data synthesis method to enhance vision-language models' performance in multi-image scenarios, achieving significant improvements across seven benchmarks without affecting general capabilities.

Authors:Ranjan Sapkota, Konstantinos I Roumeliotis, Rahul Harsha Cheppally, Marco Flores Calero, Manoj Karkee
Title: A Review of 3D Object Detection with Vision-Language Models
Abstract:
This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. >Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI
本综述系统分析了基于视觉语言模型的3D物体检测,对比传统点云方法与CLIP等现代框架,并探讨当前数据与算力瓶颈及未来研究方向。
This review systematically analyzes 3D object detection using vision-language models, comparing traditional methods with modern frameworks like CLIP and 3D LLMs while addressing challenges and future directions.

Authors:Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Shanghang Zhang
Title: ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
Abstract:
While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.
中文:ManipDreamer通过引入动作树优化指令遵循,并结合视觉指导提升视频质量,在机器人操作任务中显著提高了视频指标和成功率。
English: ManipDreamer enhances robotic manipulation video synthesis by introducing an action tree for better instruction-following and incorporating visual guidance to improve video quality, achieving significant gains in metrics and task success rates.

Authors:Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Jiaming Ji, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Yaodong Yang, Muhan Zhang, Hua Xing Zhu
Title: PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Abstract:
Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.
中文摘要:PHYBench作为新型物理评测基准,通过500道原创题目和系统化筛选流程解决现有评估缺陷,并引入表达式编辑距离提升数学评估精度,实验显示最佳模型性能(36.9%)远低于人类专家水平(61.9%)。
English Summary: PHYBench is a novel physics benchmark addressing limitations in current LLM evaluations by featuring 500 original problems with rigorous curation and introducing the EED Score for precise mathematical assessment, revealing significant performance gaps between top models and human experts.

Authors:Haohe Liu, Thomas Deacon, Wenwu Wang, Matt Paradis, Mark D. Plumbley
Title: Exploring the User Experience of AI-Assisted Sound Searching Systems for Creative Workflows
Abstract:
Locating the right sound effect efficiently is an important yet challenging topic for audio production. Most current sound-searching systems rely on pre-annotated audio labels created by humans, which can be time-consuming to produce and prone to inaccuracies, limiting the efficiency of audio production. Following the recent advancement of contrastive language-audio pre-training (CLAP) models, we explore an alternative CLAP-based sound-searching system (CLAP-UI) that does not rely on human annotations. To evaluate the effectiveness of CLAP-UI, we conducted comparative experiments with a widely used sound effect searching platform, the BBC Sound Effect Library. Our study evaluates user performance, cognitive load, and satisfaction through ecologically valid tasks based on professional sound-searching workflows. Our result shows that CLAP-UI demonstrated significantly enhanced productivity and reduced frustration while maintaining comparable cognitive demands. We also qualitatively analyzed the participants' feedback, which offered valuable perspectives on the design of future AI-assisted sound search systems.
中文摘要:本研究开发了基于对比语言-音频预训练模型的CLAP-UI声音检索系统,无需人工标注即可实现高效声音搜索,相比传统平台显著提升工作效率并降低用户挫败感,同时保持相当的认知负荷水平。
English Summary: This study introduces CLAP-UI, a sound-searching system using contrastive language-audio pre-training that eliminates human annotations, demonstrating significantly improved productivity and reduced user frustration compared to traditional platforms while maintaining similar cognitive demands.

Authors:Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Jasper C. H. Lee, Thanasis Pittas
Title: On Learning Parallel Pancakes with Mostly Uniform Weights
Abstract:
We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb{R}^d$. This task is known to have complexity $d^{Ω(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.
中文: 本研究通过匹配的统计查询下界和上界分析,证明了学习具有均匀权重的k-高斯混合模型需要拟多项式时间复杂度。
English: This study demonstrates that learning k-mixtures of Gaussians with uniform weights requires quasi-polynomial time complexity, as established through a matching Statistical Query lower bound and upper bound analysis.

Authors:Ilias Diakonikolas, Daniel M. Kane, Lisheng Ren
Title: Faster Algorithms for Agnostically Learning Disjunctions and their Implications
Abstract:
We study the algorithmic task of learning Boolean disjunctions in the distribution-free agnostic PAC model. The best known agnostic learner for the class of disjunctions over $\{0, 1\}^n$ is the $L_1$-polynomial regression algorithm, achieving complexity $2^{\tilde{O}(n^{1/2})}$. This complexity bound is known to be nearly best possible within the class of Correlational Statistical Query (CSQ) algorithms. In this work, we develop an agnostic learner for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$. Our algorithm can be implemented in the Statistical Query (SQ) model, providing the first separation between the SQ and CSQ models in distribution-free agnostic learning.
中文: 本研究提出了一种改进的布尔析取式不可知学习算法,其复杂度为 \(2^{\tilde{O}(n^{1/3})}\),首次在分布无关的不可知学习中实现了SQ与CSQ模型的分离。
English: This work presents an improved agnostic learner for Boolean disjunctions with complexity \(2^{\tilde{O}(n^{1/3})}\), achieving the first separation between SQ and CSQ models in distribution-free agnostic learning.

Authors:Meng Cui, Xianghu Yue, Xinyuan Qian, Jinzheng Zhao, Haohe Liu, Xubo Liu, Daoliang Li, Wenwu Wang
Title: Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture
Abstract:
Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achieves exemplar-free efficiency while preserving essential knowledge through compact feature representations. Specifically, HAIL-FFIA employs hierarchical representation learning with a dual-path knowledge preservation mechanism that separates general intensity knowledge from fish-specific characteristics. Additionally, it features a dynamic modality balancing system that adaptively adjusts the importance of audio versus visual information based on feeding behaviour stages. Experimental results show that HAIL-FFIA is superior to SOTA methods on AV-CIL-FFIA, achieving higher accuracy with lower storage needs while effectively mitigating catastrophic forgetting in incremental fish species learning.
中文: 本研究提出了用于鱼类摄食强度评估的新型视听数据集AV-CIL-FFIA,并开发了HAIL-FFIA分层类增量学习框架,该框架通过自适应平衡模态和无需存储原始数据的知识保留机制,显著优于现有方法。
English: This study introduces AV-CIL-FFIA, a novel audio-visual dataset for fish feeding intensity assessment, and proposes HAIL-FFIA, a hierarchical class-incremental learning framework that outperforms existing methods by adaptively balancing modalities and preserving knowledge without storing raw data.

Authors:David C Wong, Bin Wang, Gorkem Durak, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Ahmet Enis Cetin, Cagdas Topel, Nicolo Gennaro, Camila Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H Miller, Amir A Borhani, Hatice Savas, Eric M. Hart, Elizabeth A Krupinski, Ulas Bagci
Title: Shifts in Doctors' Eye Movements Between Real and AI-Generated Medical Images
Abstract:
Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases. In this work, we first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution. These metrics help uncover patterns in attention allocation and diagnostic strategies. Furthermore, we investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images. To achieve this, we examine fixation bias maps, focusing on first, last, short, and longest fixations independently, along with detailed saccades patterns, to quantify differences in gaze distribution and visual saliency between authentic and synthetic images.
中文: 眼动追踪分析通过检测眼球运动模式揭示放射科医师的诊断策略,并展示他们在观察真实与人工智能生成医学图像时注视行为的差异。
English: Eye-tracking analysis reveals radiologists' diagnostic strategies by examining eye-movement patterns and demonstrates how their gaze behavior differs when viewing real versus AI-generated medical images.

Authors:Xingyu Lu, Tianke Zhang, Chang Meng, Xiaobei Wang, Jinpeng Wang, YiFan Zhang, Shisong Tang, Changyi Liu, Haojie Ding, Kaiyu Jiang, Kaiyu Tang, Bin Wen, Hai-Tao Zheng, Fan Yang, Tingting Gao, Di Zhang, Kun Gai
Title: VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Abstract:
Exponentially growing short video platforms (SVPs) face significant challenges in moderating content detrimental to users' mental health, particularly for minors. The dissemination of such content on SVPs can lead to catastrophic societal consequences. Although substantial efforts have been dedicated to moderating such content, existing methods suffer from critical limitations: (1) Manual review is prone to human bias and incurs high operational costs. (2) Automated methods, though efficient, lack nuanced content understanding, resulting in lower accuracy. (3) Industrial moderation regulations struggle to adapt to rapidly evolving trends due to long update cycles. In this paper, we annotate the first SVP content moderation benchmark with authentic user/reviewer feedback to fill the absence of benchmark in this field. Then we evaluate various methods on the benchmark to verify the existence of the aforementioned limitations. We further propose our common-law content moderation framework named KuaiMod to address these challenges. KuaiMod consists of three components: training data construction, offline adaptation, and online deployment & refinement. Leveraging large vision language model (VLM) and Chain-of-Thought (CoT) reasoning, KuaiMod adequately models video toxicity based on sparse user feedback and fosters dynamic moderation policy with rapid update speed and high accuracy. Offline experiments and large-scale online A/B test demonstrates the superiority of KuaiMod: KuaiMod achieves the best moderation performance on our benchmark. The deployment of KuaiMod reduces the user reporting rate by 20% and its application in video recommendation increases both Daily Active User (DAU) and APP Usage Time (AUT) on several Kuaishou scenarios. We have open-sourced our benchmark at https://kuaimod.github.io.
Short video platforms struggle with content moderation due to manual review biases, automated method inaccuracies, and slow regulatory updates, prompting the development of KuaiMod—a framework using vision-language models and Chain-of-Thought reasoning to enhance moderation efficiency and accuracy.
English Summary:

Authors:Hangyu Liu, Bo Peng, Pengxiang Ding, Donglin Wang
Title: Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach
Abstract:
Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. Existing generative approaches for multi-target attacks mainly analyze the effect of the use of target labels on noise generation from a theoretical perspective, lacking practical validation and comprehensive summarization. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (2D-TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments on the standard ImageNet dataset demonstrate that 2D-TGAF consistently surpasses state-of-the-art methods in attack success rates, both on normally trained models and across various defense mechanisms.
Chinese Summary: 针对多目标对抗攻击,本研究提出二维张量引导对抗融合框架,利用扩散模型将目标标签编码为语义张量并采用新型掩码策略,在标准及防御模型上的攻击成功率均超越现有最优方法。
English Summary: Multi-target adversarial attacks are enhanced by the proposed 2D Tensor-Guided Adversarial Fusion framework, which leverages diffusion models to encode target labels into semantic tensors and employs a novel masking strategy, achieving superior attack success rates on both standard and defended models compared to existing methods.

Authors:Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
Title: X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Abstract:
Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
Chinese: X-Teaming 是一个可扩展的框架,能有效生成多轮攻击场景以揭示语言模型的安全漏洞,实现了高攻击成功率,并提供了大规模安全训练数据集以增强多轮安全对齐能力。
English: X-Teaming is a scalable framework that effectively generates multi-turn attack scenarios to expose safety vulnerabilities in language models, achieving high success rates and providing a large safety training dataset to enhance multi-turn safety alignment.

Authors:Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee
Title: RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity
Abstract:
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
中文: 本研究显示RF-DETR在复杂果园环境中检测绿色果实具有更优的精度和遮挡处理能力,而YOLOv12则以计算效率见长,适合边缘部署场景。
English: This study demonstrates that RF-DETR excels in detecting greenfruits in complex orchard environments with superior accuracy and occlusion handling, while YOLOv12 offers computational efficiency suitable for edge deployment.

Authors:Weijie Shi, Chengyi Ju, Chengzhong Liu, Jiaming Ji, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu, Yaodong Yang, Sirui Han, Yike Guo
Title: Benchmarking Multi-National Value Alignment for Large Language Models
Abstract:
Do Large Language Models (LLMs) hold positions that conflict with your country's values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting values.We conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs' values with the target country.
中文摘要:NaVAB基准通过构建五国价值观评估体系,采用价值提取流程与冲突消减机制,解决了现有方法难以衡量大语言模型与国家价值观一致性的局限,有效提升了模型与目标国家价值观的契合度。
English Summary: The NaVAB benchmark addresses limitations in evaluating Large Language Models' alignment with diverse national values by introducing a scalable assessment framework for five major countries, combining value extraction with conflict reduction mechanisms to improve model alignment.

Authors:Hongkang Li, Yihua Zhang, Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen
Title: When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers
Abstract:
Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors, each of which is the weight update from the pre-trained model to fine-tuned models for certain tasks. This approach recently gained attention as a computationally efficient inference method for model editing, e.g., multi-task learning, forgetting, and out-of-domain generalization capabilities. However, the theoretical understanding of why task vectors can execute various conceptual operations remains limited, due to the highly non-convexity of training Transformer-based models. To the best of our knowledge, this paper provides the first theoretical characterization of the generalization guarantees of task vector methods on nonlinear Transformers. We consider a conceptual learning setting, where each task is a binary classification problem based on a discriminative pattern. We theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or aligned tasks, as well as the success of task negation in unlearning one task from irrelevant or contradictory tasks. Moreover, we prove the proper selection of linear coefficients for task arithmetic to achieve guaranteed generalization to out-of-domain tasks. All of our theoretical results hold for both dense-weight parameters and their low-rank approximations. Although established in a conceptual setting, our theoretical findings were validated on a practical machine unlearning task using the large language model Phi-1.5 (1.3B).
中文: 本文首次对非线性Transformer的任务向量算法进行了理论分析,证明了其在多任务学习、遗忘任务及跨领域泛化中的有效性,并在实际机器遗忘任务中得到了验证。
English: This paper provides the first theoretical analysis of task arithmetic's generalization guarantees for nonlinear Transformers, proving its effectiveness in multi-task learning, unlearning, and out-of-domain generalization, with validation on practical machine unlearning tasks.

Authors:Serge Lionel Nikiema, Jordan Samhi, Abdoul Kader Kaboré, Jacques Klein, Tegawendé F. Bissyandé
Title: The Code Barrier: What LLMs Actually Understand?
Abstract:
Understanding code represents a core ability needed for automating software development tasks. While foundation models like LLMs show impressive results across many software engineering challenges, the extent of their true semantic understanding beyond simple token recognition remains unclear. This research uses code obfuscation as a structured testing framework to evaluate LLMs' semantic understanding capabilities. We methodically apply controlled obfuscation changes to source code and measure comprehension through two complementary tasks: generating accurate descriptions of obfuscated code and performing deobfuscation, a skill with important implications for reverse engineering applications. Our testing approach includes 13 cutting-edge models, covering both code-specialized (e.g., StarCoder2) and general-purpose (e.g., GPT-4o) architectures, evaluated on a benchmark created from CodeNet and consisting of filtered 250 Java programming problems and their solutions. Findings show a statistically significant performance decline as obfuscation complexity increases, with unexpected resilience shown by general-purpose models compared to their code-focused counterparts. While some models successfully identify obfuscation techniques, their ability to reconstruct the underlying program logic remains constrained, suggesting limitations in their semantic representation mechanisms. This research introduces a new evaluation approach for assessing code comprehension in language models and establishes empirical baselines for advancing research in security-critical code analysis applications such as reverse engineering and adversarial code analysis.
中文: 本研究通过系统化的代码混淆测试评估大语言模型的语义理解能力,发现随着混淆复杂性增加模型性能显著下降,尽管部分模型表现出韧性,但其重构程序逻辑的能力仍受限。
English: This study evaluates the semantic understanding of LLMs in code comprehension through systematic obfuscation tests, revealing performance declines with increased complexity and highlighting limitations in reconstructing program logic despite some models' resilience.

Authors:Zongcan Ding, Haodong Zhang, Peng Wu, Guansong Pang, Zhiwei Yang, Peng Wang, Yanning Zhang
Title: SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model
Abstract:
Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.
中文: SlowFastVAD是一种混合视频异常检测框架,通过快速检测器初步筛选和慢速视觉语言模型精细分析相结合,在降低计算成本的同时实现了高检测精度和可解释性。
English: SlowFastVAD is a hybrid video anomaly detection framework that combines a fast detector for initial screening with a slow, retrieval-augmented vision-language model for detailed analysis, achieving high accuracy and interpretability with reduced computational cost.

Authors:Andrew Rufail, Daniel Kim, Sean O'Brien, Kevin Zhu
Title: CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning
Abstract:
We introduce CLEAR (Contrasting Textual Feedback with Experts and Amateurs for Reasoning), a novel approach to language model reasoning that leverages the strengths of a larger (expert) model and smaller (amateur) model. The expert and amateur models each provide feedback on a model's initial output and are contrasted with each other into refined feedback. This feedback is subsequently applied to iteratively improve CLEAR's responses. Our experiments demonstrate that CLEAR outperforms state-of-the-art methods in several challenging reasoning tasks, including story outline improvement (up to 19.6% relative increase in interestingness), constrained generation (up to 18.5% increase in coverage), mathematical reasoning (up to 6.7% improvement in accuracy) and mitigation of toxicity (decrease of up to 22% in toxicity).
中文:CLEAR是一种新颖的语言模型推理方法,通过对比专家与业余模型的反馈迭代优化回答,在故事改进、受限生成、数学推理和降低毒性方面均优于现有最优方法。
English: CLEAR is a novel language model reasoning approach that contrasts expert and amateur model feedback to iteratively refine responses, achieving superior performance in story improvement, constrained generation, math reasoning, and toxicity reduction.

Authors:Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien
Title: EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
Abstract:
The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compare these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities - models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.
Chinese: EnDive基准通过评估五种代表性不足的英语方言,揭示了大型语言模型相比标准美式英语存在显著性能差距,并通过识别模型偏见推动方言感知的自然语言处理发展。
English: The EnDive benchmark evaluates large language models across five underrepresented English dialects, revealing significant performance disparities compared to Standard American English and advancing dialect-aware NLP through bias identification.

Authors:Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
Title: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Abstract:
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
中文: OLMoTrace是首个实时追踪语言模型输出至其数万亿词条训练数据的系统,通过扩展的infini-gram技术快速定位原文匹配,助力分析模型行为、事实核查及创造力评估。
English: OLMoTrace is the first real-time system that traces language model outputs back to their multi-trillion-token training data, using an extended infini-gram to quickly find verbatim matches and help analyze model behavior, fact checking, and creativity.

Authors:Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
Title: Leanabell-Prover: Posttraining Scaling in Formal Reasoning
Abstract:
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.
中文: 本研究通过结合混合数据集的持续训练和利用Lean 4编译器反馈的强化学习,提升了自动定理证明能力,在全证明生成领域实现了最先进的性能。
English: This study enhances automated theorem proving by combining continual training with hybrid datasets and reinforcement learning using Lean 4 compiler feedback, achieving state-of-the-art performance in whole-proof generation.

Authors:Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Peng Gao, Yu Qiao, Chao Dong, Yihao Liu
Title: Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision
Abstract:
We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.
中文摘要:OmniLV是一个通用的多模态低层视觉框架,通过文本和视觉提示处理四大类百余项子任务,基于扩散Transformer架构实现在1K分辨率下的高保真效果,其分离式指令编码与协同训练机制有效提升了多任务泛化能力。
English Summary: OmniLV is a universal multimodal framework for low-level vision tasks that uses text and visual prompts to handle over 100 sub-tasks across four categories, achieving high fidelity at 1K resolution through Diffusion Transformer-based architecture and specialized training techniques.

Authors:Peng Wu, Wanshun Su, Guansong Pang, Yujia Sun, Qingsen Yan, Peng Wang, Yanning Zhang
Title: AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
Abstract:
With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP's generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.
中文摘要:本研究提出了一种弱监督的视听协作框架,通过基于CLIP的跨模态融合和不确定性驱动的特征蒸馏技术,显著提升了视频异常检测性能,在多个基准测试中表现优异。
English Summary: This study introduces a weakly supervised audio-visual framework that enhances video anomaly detection through CLIP-based cross-modal fusion and uncertainty-driven feature distillation, achieving superior performance across multiple benchmarks.

Authors:Maolin Wang, Xiangyu Zhao
Title: MetaLoRA: Tensor-Enhanced Adaptive Low-Rank Fine-tuning
Abstract:
There has been a significant increase in the deployment of neural network models, presenting substantial challenges in model adaptation and fine-tuning. Efficient adaptation is crucial in maintaining model performance across diverse tasks and domains. While Low-Rank Adaptation (LoRA) has emerged as a promising parameter-efficient fine-tuning method, its fixed parameter nature limits its ability to handle dynamic task requirements effectively. Adapting models to new tasks can be challenging due to the need for extensive fine-tuning. Current LoRA variants primarily focus on general parameter reduction while overlooking the importance of dynamic parameter adjustment and meta-learning capabilities. Moreover, existing approaches mainly address static adaptations, neglecting the potential benefits of task-aware parameter generation in handling diverse task distributions. To address these limitations, this Ph.D. research proposes a LoRA generation approach to model task relationships and introduces MetaLoRA, a novel parameter-efficient adaptation framework incorporating meta-learning principles. This work develops a comprehensive architecture that integrates meta-parameter generation with adaptive low-rank decomposition, enabling efficient handling of both task-specific and task-agnostic features. MetaLoRA accurately captures task patterns by incorporating meta-learning mechanisms and dynamic parameter adjustment strategies. To our knowledge, this research represents the first attempt to provide a meta-learning enhanced LoRA variant, offering improved adaptation capability while maintaining computational efficiency in model fine-tuning.
中文: 本研究提出MetaLoRA框架,通过结合元学习机制与动态参数调整策略,克服了传统低秩适配方法的静态局限性,能够在保持计算效率的同时实现跨任务的优越适应能力。
English: This Ph.D. research introduces MetaLoRA, a novel parameter-efficient framework that integrates meta-learning with dynamic parameter adjustment to overcome the limitations of static LoRA methods, enabling superior adaptation across diverse tasks while maintaining computational efficiency.

Authors:Yifan Wu, Zhiyang Dou, Yuko Ishiwaka, Shun Ogawa, Yuke Lou, Wenping Wang, Lingjie Liu, Taku Komura
Title: CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
Abstract:
Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.
中文: 本文提出集体行为模仿学习(CBIL)方法,通过自监督视频表征学习和对抗模仿直接从视频中学习逼真的鱼群行为,无需运动轨迹即可实现异常行为检测等多种应用。
English: This paper introduces Collective Behavior Imitation Learning (CBIL), a scalable method that learns realistic fish schooling behaviors directly from videos using self-supervised video representation learning and adversarial imitation, eliminating the need for motion trajectories and enabling applications like abnormal behavior detection.

Authors:Samy Abdel-Ghaffar, Isaac Galatzer-Levy, Conor Heneghan, Xin Liu, Sarah Kernasovskiy, Brennan Garrett, Andrew Barakat, Daniel McDuff
Title: Passive Measurement of Autonomic Arousal in Real-World Settings
Abstract:
The autonomic nervous system (ANS) is activated during stress, which can have negative effects on cardiovascular health, sleep, the immune system, and mental health. While there are ways to quantify ANS activity in laboratories, there is a paucity of methods that have been validated in real-world contexts. We present the Fitbit Body Response Algorithm, an approach to continuous remote measurement of ANS activation through widely available remote wrist-based sensors. The design was validated via two experiments, a Trier Social Stress Test (n = 45) and ecological momentary assessments (EMA) of perceived stress (n=87), providing both controlled and ecologically valid test data. Model performance predicting perceived stress when using all available sensor modalities was consistent with expectations (accuracy=0.85) and outperformed models with access to only a subset of the signals. We discuss and address challenges to sensing that arise in real world settings that do not present in conventional lab environments.
Chinese: Fitbit身体反应算法通过腕部传感器实现对自主神经系统激活的持续远程监测,经过实验室和真实环境下的压力测试验证,在预测感知压力方面表现出高准确性。
English: The Fitbit Body Response Algorithm enables continuous remote monitoring of autonomic nervous system activation using wrist sensors, validated through controlled and real-world stress tests with high accuracy in predicting perceived stress.

Authors:Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Lina Yao, Julian McAuley
Title: CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks
Abstract:
Large Language Models (LLMs) are identified as being susceptible to indirect prompt injection attack, where the model undesirably deviates from user-provided instructions by executing tasks injected in the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. In this paper, we propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context. By pruning such neurons, we encourage the LLM to treat the text spans of input prompt context as only pure data, instead of any indicator of instruction following. These neurons are identified via feature attribution with a loss function induced from an upperbound of the Direct Preference Optimization (DPO) objective. We show that such a loss function enables effective feature attribution with only a few samples. We further improve on the quality of feature attribution, by exploiting an observed triggering effect in instruction following. Our approach does not impose any formatting on the original prompt or introduce extra test-time LLM calls. Experiments show that CachePrune significantly reduces attack success rates without compromising the response quality. Note: This paper aims to defend against indirect prompt injection attacks, with the goal of developing more secure and robust AI systems.
中文摘要:大型语言模型易受间接提示注入攻击,本文提出CachePrune防御方法,通过识别并剪除触发任务的神经元,使模型将提示上下文仅视为纯数据而非指令,在保持响应质量的同时显著降低攻击成功率。
English Summary: Large Language Models are vulnerable to indirect prompt injection attacks, and this paper proposes CachePrune, a defense method that identifies and prunes task-triggering neurons to treat prompt context as pure data rather than instructions, significantly reducing attack success rates without compromising response quality.

Authors:Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Yangqiu Song, Bryan Hooi
Title: Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Abstract:
Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the original input instructions and executing maliciously injected instructions within data content, such as web documents retrieved from search engines. Existing defense methods, including prompt-engineering and fine-tuning approaches, typically instruct models to follow the original input instructions while suppressing their tendencies to execute injected instructions. However, our experiments reveal that suppressing instruction-following tendencies is challenging. Through analyzing failure cases, we observe that although LLMs tend to respond to any recognized instructions, they are aware of which specific instructions they are executing and can correctly reference them within the original prompt. Motivated by these findings, we propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs. Our approach prompts LLMs to generate responses that include both answers and their corresponding instruction references. Based on these references, we filter out answers not associated with the original input instructions. Comprehensive experiments demonstrate that our method outperforms prompt-engineering baselines and achieves performance comparable to fine-tuning methods, reducing the attack success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has minimal impact on overall utility.
中文摘要:大型语言模型易受提示注入攻击而执行恶意指令,但新防御方法利用其指令跟随能力,通过要求响应必须引用原始指令来过滤未授权回答,显著降低攻击成功率且不影响正常功能。
English Summary: Large language models are vulnerable to prompt injection attacks that manipulate them into executing malicious instructions, but a new defense method leverages their instruction-following ability to filter unauthorized responses by requiring reference to original instructions, significantly reducing attack success while preserving utility.

Authors:Hude Liu, Jerry Yao-Chieh Hu, Zhao Song, Han Liu
Title: Attention Mechanism, Max-Affine Partition, and Universal Approximation
Abstract:
We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
中文: 该研究通过将单头注意力机制解释为输入域划分机制,证明了配备基础结构的单头自注意力和交叉注意力能够普遍逼近连续函数和勒贝格可积函数。
English: The study demonstrates that single-head self- and cross-attention mechanisms with basic attached structures can universally approximate continuous and Lebesgue integrable functions by interpreting attention as an input domain-partitioning mechanism.

Authors:Yunzhong Zhang, Bo Xiong, You Zhou, Changqing Su, Zhen Cheng, Zhaofei Yu, Xun Cao, Tiejun Huang
Title: Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras
Abstract:
The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.
中文: 本研究提出尖峰成像测速法(SIV),这是一种利用尖峰相机和创新模块的深度学习框架,在复杂流场中实现了卓越的粒子图像测速性能,并提供了开源数据集和代码实现。
English: This study introduces Spike Imaging Velocimetry (SIV), a deep learning framework leveraging spike cameras and innovative modules for superior Particle Image Velocimetry performance in complex flow fields, supported by an open-source dataset and implementation.

Authors:Subash Neupane, Sudip Mittal, Shahram Rahimi
Title: Towards a HIPAA Compliant Agentic AI System in Healthcare
Abstract:
Agentic AI systems powered by Large Language Models (LLMs) as their foundational reasoning engine, are transforming clinical workflows such as medical report generation and clinical summarization by autonomously analyzing sensitive healthcare data and executing decisions with minimal human oversight. However, their adoption demands strict compliance with regulatory frameworks such as Health Insurance Portability and Accountability Act (HIPAA), particularly when handling Protected Health Information (PHI). This work-in-progress paper introduces a HIPAA-compliant Agentic AI framework that enforces regulatory compliance through dynamic, context-aware policy enforcement. Our framework integrates three core mechanisms: (1) Attribute-Based Access Control (ABAC) for granular PHI governance, (2) a hybrid PHI sanitization pipeline combining regex patterns and BERT-based model to minimize leakage, and (3) immutable audit trails for compliance verification.
中文: 这项进行中的研究提出一个符合HIPAA规范的自主AI框架,通过动态策略执行确保监管合规性,集成细粒度访问控制、混合式受保护健康信息脱敏及不可篡改审计追踪,以实现安全的临床数据处理。
English: This work-in-progress paper presents a HIPAA-compliant Agentic AI framework that ensures regulatory compliance through dynamic policy enforcement, integrating granular access control, hybrid PHI sanitization, and immutable audit trails for secure clinical data processing.

Authors:Cong Guo, Chiyue Wei, Jiaming Tang, Bowen Duan, Song Han, Hai Li, Yiran Chen
Title: Transitive Array: An Efficient GEMM Accelerator with Result Reuse
Abstract:
Deep Neural Networks (DNNs) and Large Language Models (LLMs) have revolutionized artificial intelligence, yet their deployment faces significant memory and computational challenges, especially in resource-constrained environments. Quantization techniques have mitigated some of these issues by reducing data precision, primarily focusing on General Matrix Multiplication (GEMM). This study introduces a novel sparsity paradigm, transitive sparsity, which leverages the reuse of previously computed results to substantially minimize computational overhead in GEMM operations. By representing transitive relations using a directed acyclic graph, we develop an efficient strategy for determining optimal execution orders, thereby overcoming inherent challenges related to execution dependencies and parallelism. Building on this foundation, we present the Transitive Array, a multiplication-free accelerator designed to exploit transitive sparsity in GEMM. Our architecture effectively balances computational workloads across multiple parallel lanes, ensuring high efficiency and optimal resource utilization. Comprehensive evaluations demonstrate that the Transitive Array achieves approximately 7.46$\times$ and 3.97$\times$ speedup and 2.31$\times$ and 1.65$\times$ energy reduction compared to state-of-the-art accelerators such as Olive and BitVert while maintaining comparable model accuracy on LLaMA models.
本研究提出传递稀疏性新范式,通过有向无环图优化计算顺序并设计Transitive Array加速器,在保持模型精度的同时显著提升GEMM运算速度并降低能耗。
This study introduces transitive sparsity, a novel paradigm using directed acyclic graphs to optimize computation order and presents the Transitive Array accelerator, achieving significant speedup and energy reduction in GEMM operations while maintaining model accuracy.

Authors:Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu
Title: Universal Approximation with Softmax Attention
Abstract:
We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.
中文: 本研究证明,通过线性变换,双层自注意力和带softmax的单层自注意力均能通用逼近紧致域上的连续序列到序列函数,揭示了自注意力可模拟广义ReLU,使多头注意力无需前馈网络即可独立作为通用逼近器。
English: This study demonstrates that both two-layer self-attention and one-layer self-attention with softmax can universally approximate continuous sequence-to-sequence functions using linear transformations, revealing that self-attention mimics a generalized ReLU and enables multi-head attention to serve as a standalone universal approximator without relying on feed-forward networks.

Authors:Rohan Surana, Junda Wu, Zhouhang Xie, Yu Xia, Harald Steck, Dawen Liang, Nathan Kallus, Julian McAuley
Title: From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System
Abstract:
Conversational recommender systems (CRS) typically require extensive domain-specific conversational datasets, yet high costs, privacy concerns, and data-collection challenges severely limit their availability. Although Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities, practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints, especially in sensitive or rapidly evolving domains. However, training these smaller models effectively still demands substantial domain-specific conversational data, which remains challenging to obtain. To address these limitations, we propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques. Specifically, our method utilizes publicly available non-conversational domain data, including item metadata, user reviews, and collaborative signals, as seed inputs. By employing active learning strategies to select the most informative seed samples, our approach efficiently guides LLMs to generate synthetic, semantically coherent conversational interactions tailored explicitly to the target domain. Extensive experiments validate that conversational data generated by our proposed framework significantly improves the performance of LLM-based CRS models, effectively addressing the challenges of building CRS in no- or low-resource scenarios.
中文摘要:本研究提出了一种主动数据增强框架,利用主动学习引导的大型语言模型从非对话数据源生成领域特定的对话训练数据,有效提升了数据稀缺场景下推荐系统的性能。
English Summary: This study introduces an active data augmentation framework that uses large language models guided by active learning to generate domain-specific conversational training data from non-conversational sources, effectively enhancing recommender system performance in data-scarce environments.

Authors:Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji
Title: Acting Less is Reasoning More! Teaching Model to Act Efficiently
Abstract:
Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use. This often leads to excessive tool calling, incurring high computational costs and hindering the development of internal reasoning capabilities - a phenomenon known as \textit{cognitive offloading}. To this end, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers answer correctness and corresponding tool use behavior of model to reach that answer. To validate the effectiveness, we introduce the metric of \textit{tool productivity}, defined as the ratio between the number of correct answers and the total number of tool calls across all test cases. This metric reflects how efficiently tool usage contributes to successful task completion, with higher values indicating smarter and more autonomous reasoning. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 68.3\% and improves tool productivity by up to 215.4\%, while maintaining comparable answer accuracy.
中文: 工具集成推理通过外部工具增强大语言模型,但现有强化学习方法常导致工具滥用,因此提出OTC-PO框架,在保证答案准确性的同时最小化工具调用,显著提升了工具使用效率。
English: Tool-integrated reasoning enhances LLMs by using external tools, but current RL methods often cause excessive tool use, so OTC-PO is proposed to optimize for both accuracy and minimal tool calls, significantly improving efficiency without compromising performance.

Authors:Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qingpeng Cai, Fei Pan, Peng Jiang, Kun Gai, Bo An, Xiangyu Zhao
Title: Generative Auto-Bidding with Value-Guided Explorations
Abstract:
Auto-bidding, with its strong capability to optimize bidding decisions within dynamic and competitive online environments, has become a pivotal strategy for advertising platforms. Existing approaches typically employ rule-based strategies or Reinforcement Learning (RL) techniques. However, rule-based strategies lack the flexibility to adapt to time-varying market conditions, and RL-based methods struggle to capture essential historical dependencies and observations within Markov Decision Process (MDP) frameworks. Furthermore, these approaches often face challenges in ensuring strategy adaptability across diverse advertising objectives. Additionally, as offline training methods are increasingly adopted to facilitate the deployment and maintenance of stable online strategies, the issues of documented behavioral patterns and behavioral collapse resulting from training on fixed offline datasets become increasingly significant. To address these limitations, this paper introduces a novel offline Generative Auto-bidding framework with Value-Guided Explorations (GAVE). GAVE accommodates various advertising objectives through a score-based Return-To-Go (RTG) module. Moreover, GAVE integrates an action exploration mechanism with an RTG-based evaluation method to explore novel actions while ensuring stability-preserving updates. A learnable value function is also designed to guide the direction of action exploration and mitigate Out-of-Distribution (OOD) problems. Experimental results on two offline datasets and real-world deployments demonstrate that GAVE outperforms state-of-the-art baselines in both offline evaluations and online A/B tests. By applying the core methods of this framework, we proudly secured first place in the NeurIPS 2024 competition, 'AIGB Track: Learning Auto-Bidding Agents with Generative Models'.
中文摘要:本文提出GAVE离线生成式自动竞价框架,通过价值引导探索和基于RTG的评分模块,解决了现有方法在广告目标适应性和稳定性方面的不足,并在实验和实际部署中验证了其优越性能。
English Summary: This paper introduces GAVE, a novel offline generative auto-bidding framework that overcomes limitations of existing methods by incorporating value-guided explorations and a score-based RTG module to ensure adaptability across advertising objectives while maintaining stability.

Authors:Yichao Feng, Shuai Zhao, Yueqiu Li, Luwei Xiao, Xiaobao Wu, Anh Tuan Luu
Title: Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation
Abstract:
Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.
Chinese Summary: 本文提出了一种新颖的方面摘要框架——自方面检索增强摘要生成,通过嵌入驱动检索机制提取相关文本片段,有效解决了标记限制问题并提升了摘要性能,无需依赖上下文学习。
English Summary: This paper introduces a novel framework, Self-Aspect Retrieval Enhanced Summary Generation, which uses embedding-driven retrieval to identify relevant text segments for aspect-based summarization, effectively overcoming token limits and improving performance without relying on in-context learning.

Authors:Hao Li, Liuzhenghao Lv, He Cao, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan
Title: How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
Abstract:
Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.
大语言模型在分子科学中存在幻觉问题,而提出的Mol-Hallu指标和HRPP方法能有效评估并减少此类错误,从而提升科学应用的可靠性。
Large language models face hallucination issues in molecular science, but the proposed Mol-Hallu metric and HRPP method effectively evaluate and reduce these errors to enhance reliability in scientific applications.

Authors:Yechao Zhang, Yuxuan Zhou, Tianyu Li, Minghui Li, Shengshan Hu, Wei Luo, Leo Yu Zhang
Title: Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets
Abstract:
Transfer learning from pre-trained encoders has become essential in modern machine learning, enabling efficient model adaptation across diverse tasks. However, this combination of pre-training and downstream adaptation creates an expanded attack surface, exposing models to sophisticated backdoor embeddings at both the encoder and dataset levels--an area often overlooked in prior research. Additionally, the limited computational resources typically available to users of pre-trained encoders constrain the effectiveness of generic backdoor defenses compared to end-to-end training from scratch. In this work, we investigate how to mitigate potential backdoor risks in resource-constrained transfer learning scenarios. Specifically, we conduct an exhaustive analysis of existing defense strategies, revealing that many follow a reactive workflow based on assumptions that do not scale to unknown threats, novel attack types, or different training paradigms. In response, we introduce a proactive mindset focused on identifying clean elements and propose the Trusted Core (T-Core) Bootstrapping framework, which emphasizes the importance of pinpointing trustworthy data and neurons to enhance model security. Our empirical evaluations demonstrate the effectiveness and superiority of T-Core, specifically assessing 5 encoder poisoning attacks, 7 dataset poisoning attacks, and 14 baseline defenses across five benchmark datasets, addressing four scenarios of 3 potential backdoor threats.
中文: 本研究针对资源受限迁移学习中的后门漏洞,提出了主动防御框架T-Core Bootstrapping,通过识别可信数据和神经元来增强模型安全性,实验证明其防御效果优于现有方法。
English: This study addresses backdoor vulnerabilities in resource-constrained transfer learning by proposing the proactive T-Core Bootstrapping framework, which identifies trustworthy data and neurons to enhance security, demonstrating superior defense against various poisoning attacks compared to existing methods.

Authors:Uyen Phan, Ozer Can Devecioglu, Serkan Kiranyaz, Moncef Gabbouj
Title: Progressive Transfer Learning for Multi-Pass Fundus Image Restoration
Abstract:
Diabetic retinopathy is a leading cause of vision impairment, making its early diagnosis through fundus imaging critical for effective treatment planning. However, the presence of poor quality fundus images caused by factors such as inadequate illumination, noise, blurring and other motion artifacts yields a significant challenge for accurate DR screening. In this study, we propose progressive transfer learning for multi pass restoration to iteratively enhance the quality of degraded fundus images, ensuring more reliable DR screening. Unlike previous methods that often focus on a single pass restoration, multi pass restoration via PTL can achieve a superior blind restoration performance that can even improve most of the good quality fundus images in the dataset. Initially, a Cycle GAN model is trained to restore low quality images, followed by PTL induced restoration passes over the latest restored outputs to improve overall quality in each pass. The proposed method can learn blind restoration without requiring any paired data while surpassing its limitations by leveraging progressive learning and fine tuning strategies to minimize distortions and preserve critical retinal features. To evaluate PTL's effectiveness on multi pass restoration, we conducted experiments on DeepDRiD, a large scale fundus imaging dataset specifically curated for diabetic retinopathy detection. Our result demonstrates state of the art performance, showcasing PTL's potential as a superior approach to iterative image quality restoration.
Chinese: 本研究提出渐进式迁移学习方法进行多次修复,通过迭代提升眼底图像质量实现更可靠的糖尿病视网膜病变筛查,无需配对数据即在DeepDRiD数据集上取得了最优性能。
English: This study introduces progressive transfer learning for multi-pass restoration to iteratively enhance degraded fundus images, enabling more reliable diabetic retinopathy screening without requiring paired data and achieving state-of-the-art performance on the DeepDRiD dataset.

Authors:Xiaomei Zhang, Zhaoxi Zhang, Yanjun Zhang, Xufei Zheng, Leo Yu Zhang, Shengshan Hu, Shirui Pan
Title: Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks
Abstract:
Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.
中文: 本研究提出基于掩码语言模型的检测方法(MLMD)来识别文本对抗攻击,并通过引入梯度引导的MLMD(GradMLMD)优化计算效率,在保持检测性能的同时跳过非关键词以显著降低资源消耗。
English: The study introduces Masked Language Model-based Detection (MLMD) to identify textual adversarial attacks by leveraging mask and unmask operations, and further optimizes it with Gradient-guided MLMD (GradMLMD) to reduce computational overhead by skipping non-keywords without sacrificing performance.

Authors:Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj Ammanabrolu, Julian McAuley
Title: A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models
Abstract:
Personalized preference alignment for large language models (LLMs), the process of tailoring LLMs to individual users' preferences, is an emerging research direction spanning the area of NLP and personalization. In this survey, we present an analysis of works on personalized alignment and modeling for LLMs. We introduce a taxonomy of preference alignment techniques, including training time, inference time, and additionally, user-modeling based methods. We provide analysis and discussion on the strengths and limitations of each group of techniques and then cover evaluation, benchmarks, as well as open problems in the field.
中文摘要:本综述分析了大语言模型的个性化对齐技术,将其分为基于训练、推理和用户建模的方法,并评估了各类技术的优缺点及该领域现存挑战。
English Summary: This survey analyzes personalized alignment techniques for large language models, categorizing them into training-based, inference-based, and user-modeling methods while evaluating their strengths, limitations, and current challenges in the field.

Authors:Yong Bai, Rui Xiang, Kaiyuan Li, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, Kun Gai
Title: CHIME: A Compressive Framework for Holistic Interest Modeling
Abstract:
Modeling holistic user interests is important for improving recommendation systems but is challenged by high computational cost and difficulty in handling diverse information with full behavior context. Existing search-based methods might lose critical signals during behavior selection. To overcome these limitations, we propose CHIME: A Compressive Framework for Holistic Interest Modeling. It uses adapted large language models to encode complete user behaviors with heterogeneous inputs. We introduce multi-granular contrastive learning objectives to capture both persistent and transient interest patterns and apply residual vector quantization to generate compact embeddings. CHIME demonstrates superior ranking performance across diverse datasets, establishing a robust solution for scalable holistic interest modeling in recommendation systems.
中文: CHIME是一个压缩框架,采用适配的大语言模型和多粒度对比学习来编码完整的用户行为,在推荐系统中实现了卓越的排序性能,为可扩展的整体兴趣建模提供了稳健解决方案。
English: CHIME is a compressive framework that uses adapted large language models and multi-granular contrastive learning to encode complete user behaviors, achieving superior ranking performance for scalable holistic interest modeling in recommendation systems.

Authors:Shutong Chen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang
Title: FedMerge: Federated Personalization via Model Merging
Abstract:
One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge'' approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client's task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.
Chinese Summary: FedMerge提出了一种个性化联邦学习方法,通过为每个客户端自动优化并定制权重来合并多个全局模型,无需本地微调即可有效服务非独立同分布客户端,并在多种场景下超越现有方法。
English Summary: FedMerge introduces a personalized federated learning approach by optimally merging multiple global models with customized weights for each client, eliminating the need for local fine-tuning and outperforming existing methods in non-IID settings.

Authors:Kaiyuan Li, Rui Xiang, Yong Bai, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, Kun Gai
Title: BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation
Abstract:
Multi-modal sequential recommendation systems leverage auxiliary signals (e.g., text, images) to alleviate data sparsity in user-item interactions. While recent methods exploit large language models to encode modalities into discrete semantic IDs for autoregressive prediction, we identify two critical limitations: (1) Existing approaches adopt fragmented quantization, where modalities are independently mapped to semantic spaces misaligned with behavioral objectives, and (2) Over-reliance on semantic IDs disrupts inter-modal semantic coherence, thereby weakening the expressive power of multi-modal representations for modeling diverse user preferences. To address these challenges, we propose a Behavior-Bind multi-modal Quantization for Sequential Recommendation (BBQRec for short) featuring dual-aligned quantization and semantics-aware sequence modeling. First, our behavior-semantic alignment module disentangles modality-agnostic behavioral patterns from noisy modality-specific features through contrastive codebook learning, ensuring semantic IDs are inherently tied to recommendation tasks. Second, we design a discretized similarity reweighting mechanism that dynamically adjusts self-attention scores using quantized semantic relationships, preserving multi-modal synergies while avoiding invasive modifications to the sequence modeling architecture. Extensive evaluations across four real-world benchmarks demonstrate BBQRec's superiority over the state-of-the-art baselines.
中文摘要:该摘要提出BBQRec系统,通过行为语义对齐的量化方法和语义感知序列建模,解决了现有多模态推荐中语义空间与行为目标错位及模态间语义连贯性受损的问题,在四个真实数据集上验证了其优越性。
English Summary: This abstract introduces BBQRec, a novel multi-modal sequential recommendation system that overcomes limitations in existing methods by aligning semantic IDs with behavioral objectives and preserving inter-modal coherence through dual-aligned quantization and semantics-aware modeling.

Authors:Changqing Su, Yanqin Chen, Zihan Lin, Zhen Cheng, You Zhou, Bo Xiong, Zhaofei Yu, Tiejun Huang
Title: Inter-event Interval Microscopy for Event Cameras
Abstract:
Event cameras, an innovative bio-inspired sensor, differ from traditional cameras by sensing changes in intensity rather than directly perceiving intensity and recording these variations as a continuous stream of "events". The intensity reconstruction from these sparse events has long been a challenging problem. Previous approaches mainly focused on transforming motion-induced events into videos or achieving intensity imaging for static scenes by integrating modulation devices at the event camera acquisition end. In this paper, for the first time, we achieve event-to-intensity conversion using a static event camera for both static and dynamic scenes in fluorescence microscopy. Unlike conventional methods that primarily rely on event integration, the proposed Inter-event Interval Microscopy (IEIM) quantifies the time interval between consecutive events at each pixel. With a fixed threshold in the event camera, the time interval can precisely represent the intensity. At the hardware level, the proposed IEIM integrates a pulse light modulation device within a microscope equipped with an event camera, termed Pulse Modulation-based Event-driven Fluorescence Microscopy. Additionally, we have collected IEIMat dataset under various scenes including high dynamic range and high-speed scenarios. Experimental results on the IEIMat dataset demonstrate that the proposed IEIM achieves superior spatial and temporal resolution, as well as a higher dynamic range, with lower bandwidth compared to other methods. The code and the IEIMat dataset will be made publicly available.
中文: 本文提出的帧间事件间隔显微镜技术(IEIM)通过量化事件间的时间间隔,首次在荧光显微镜中实现了静态和动态场景下的事件到强度转换,相比现有方法具有更高的时空分辨率、动态范围和更低的带宽需求。
English: This paper introduces Inter-event Interval Microscopy (IEIM), a novel method that achieves event-to-intensity conversion for both static and dynamic scenes in fluorescence microscopy by quantifying time intervals between events, demonstrating superior resolution and dynamic range with lower bandwidth compared to existing approaches.

Authors:Tianyu Cui, Xinjie Lin, Sijia Li, Miao Chen, Qilei Yin, Qi Li, Ke Xu
Title: TrafficLLM: Enhancing Large Language Models for Network Traffic Analysis with Generic Traffic Representation
Abstract:
Machine learning (ML) powered network traffic analysis has been widely used for the purpose of threat detection. Unfortunately, their generalization across different tasks and unseen data is very limited. Large language models (LLMs), known for their strong generalization capabilities, have shown promising performance in various domains. However, their application to the traffic analysis domain is limited due to significantly different characteristics of network traffic. To address the issue, in this paper, we propose TrafficLLM, which introduces a dual-stage fine-tuning framework to learn generic traffic representation from heterogeneous raw traffic data. The framework uses traffic-domain tokenization, dual-stage tuning pipeline, and extensible adaptation to help LLM release generalization ability on dynamic traffic analysis tasks, such that it enables traffic detection and traffic generation across a wide range of downstream tasks. We evaluate TrafficLLM across 10 distinct scenarios and 229 types of traffic. TrafficLLM achieves F1-scores of 0.9875 and 0.9483, with up to 80.12% and 33.92% better performance than existing detection and generation methods. It also shows strong generalization on unseen traffic with an 18.6% performance improvement. We further evaluate TrafficLLM in real-world scenarios. The results confirm that TrafficLLM is easy to scale and achieves accurate detection performance on enterprise traffic.
中文: 本文提出TrafficLLM双阶段微调框架,通过领域分词和自适应扩展使大语言模型适用于网络流量分析,在多种检测与生成任务中相比现有方法展现出卓越的泛化能力和性能提升。
English: The paper introduces TrafficLLM, a dual-stage fine-tuning framework that adapts large language models to network traffic analysis, achieving superior generalization and performance across diverse detection and generation tasks compared to existing methods.

Authors:Gia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, Trung-Nghia Le, Minh-Triet Tran
Title: FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement
Abstract:
Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
中文: 本文提出融合优化方法,通过概念融合技术增强数据多样性应对过拟合,并采用局部优化损失函数防止属性泄露,在保持图像真实感的同时显著优于现有先进方法。
English: The paper introduces Fuse-and-Refine (FaR), a novel method that addresses overfitting and attribute leakage in text-to-image generation through Concept Fusion for data augmentation and Localized Refinement loss for precise attribute control, demonstrating superior performance over existing approaches.

Authors:Tu Ao, Yanhua Yu, Yuling Wang, Yang Deng, Zirui Guo, Liang Pang, Pinghui Wang, Tat-Seng Chua, Xiao Zhang, Zhen Cai
Title: LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph
Abstract:
Large Language Models (LLMs) have impressive capabilities in text understanding and zero-shot reasoning. However, delays in knowledge updates may cause them to reason incorrectly or produce harmful results. Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs by structurally organizing and connecting a wide range of entities and relations. Existing KG-based LLM reasoning methods only inject KGs' knowledge into prompts in a textual form, ignoring its structural information. Moreover, they mostly rely on close-source models or open-source models with large parameters, which poses challenges to high resource consumption. To address this, we propose a novel Lightweight and efficient Prompt learning-ReasOning Framework for KGQA (LightPROF), which leverages the full potential of LLMs to tackle complex reasoning tasks in a parameter-efficient manner. Specifically, LightPROF follows a "Retrieve-Embed-Reason process", first accurately, and stably retrieving the corresponding reasoning graph from the KG through retrieval module. Next, through a Transformer-based Knowledge Adapter, it finely extracts and integrates factual and structural information from the KG, then maps this information to the LLM's token embedding space, creating an LLM-friendly prompt to be used by the LLM for the final reasoning. Additionally, LightPROF only requires training Knowledge Adapter and can be compatible with any open-source LLM. Extensive experiments on two public KGQA benchmarks demonstrate that LightPROF achieves superior performance with small-scale LLMs. Furthermore, LightPROF shows significant advantages in terms of input token count and reasoning time.
中文: LightPROF是一种轻量级框架,通过参数高效的提示学习方法,将知识图谱的结构化信息有效整合到大型语言模型中,从而利用小规模模型实现卓越的推理性能。
English: LightPROF is a lightweight framework that enhances LLMs' reasoning by efficiently integrating structural knowledge from Knowledge Graphs through a parameter-efficient prompt learning approach, achieving superior performance with small-scale models.

Authors:Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen
Title: Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
Abstract:
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.
中文: 通过包含大规模思维链数据的四步系统训练方法,小型语言模型Phi-4 Mini在数学推理任务上超越了多个大型模型,证明精心设计的训练方案能有效释放资源受限小模型的强大推理能力。
English: A systematic four-step training recipe using large-scale Chain-of-Thought data enables small language models like Phi-4-Mini to surpass larger models in mathematical reasoning tasks, demonstrating that strategic training can unlock strong reasoning capabilities even in resource-constrained models.

Authors:Tanmay Chakraborty, Marion Koelle, Jörg Schlötterer, Nadine Schlicker, Christian Wirth, Christin Seifert
Title: Explanation format does not matter; but explanations do -- An Eggsbert study on explaining Bayesian Optimisation tasks
Abstract:
Bayesian Optimisation (BO) is a family of methods for finding optimal parameters when the underlying function to be optimised is unknown. BO is used, for example, for hyperparameter tuning in machine learning and as an expert support tool for tuning cyberphysical systems. For settings where humans are involved in the tuning task, methods have been developed to explain BO (Explainable Bayesian Optimization, XBO). However, there is little guidance on how to present XBO results to humans so that they can tune the system effectively and efficiently. In this paper, we investigate how the XBO explanation format affects users' task performance, task load, understanding and trust in XBO. We chose a task that is accessible to a wide range of users. Specifically, we set up an egg cooking scenario with 6 parameters that participants had to adjust to achieve a perfect soft-boiled egg. We compared three different explanation formats: a bar chart, a list of rules and a textual explanation in a between-subjects online study with 213 participants. Our results show that adding any type of explanation increases task success, reduces the number of trials needed to achieve success, and improves comprehension and confidence. While explanations add more information for participants to process, we found no increase in user task load. We also found that the aforementioned results were independent of the explanation format; all formats had a similar effect. This is an interesting finding for practical applications, as it suggests that explanations can be added to BO tuning tasks without the burden of designing or selecting specific explanation formats. In the future, it would be interesting to investigate scenarios of prolonged use of the explanation formats and whether they have different effects on users' mental models of the underlying system.
中文: 本研究表明,在贝叶斯优化中加入任何形式的解释都能显著提高用户的任务表现、理解和信心,且不会增加任务负担,所有测试的解释格式均产生相似效果。
English: This study demonstrates that incorporating any form of explanation into Bayesian Optimization significantly improves user performance, comprehension, and confidence without increasing task load, with all tested formats yielding similar benefits.

Authors:Yan Wang, Baoxiong Jia, Ziyu Zhu, Siyuan Huang
Title: Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
Abstract:
Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec-3d.github.io/
中文: MPEC提出了一种掩蔽点实体对比学习方法,通过三维实体-语言对齐和跨视角点云一致性实现了开放词汇三维语义分割的突破,在ScanNet上达到最优性能并展现出卓越的零样本场景理解能力。
English: MPEC introduces a masked point-entity contrastive learning method for open-vocabulary 3D semantic segmentation, achieving state-of-the-art results on ScanNet and demonstrating strong zero-shot scene understanding capabilities across diverse tasks.

Authors:Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, Ruihai Wu, Baoxiong Jia, Carlo Sferrazza, Hao Dong, Siyuan Huang, Yue Wang, Jitendra Malik, Pieter Abbeel
Title: RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning
Abstract:
Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing evaluation protocols. Collecting real-world data is resource-intensive and inefficient, while benchmarking in real-world scenarios remains highly complex. Synthetic data and simulation offer promising alternatives, yet existing efforts often fall short in data quality, diversity, and benchmark standardization. To address these challenges, we introduce RoboVerse, a comprehensive framework comprising a simulation platform, a synthetic dataset, and unified benchmarks. Our simulation platform supports multiple simulators and robotic embodiments, enabling seamless transitions between different environments. The synthetic dataset, featuring high-fidelity physics and photorealistic rendering, is constructed through multiple approaches. Additionally, we propose unified benchmarks for imitation learning and reinforcement learning, enabling evaluation across different levels of generalization. At the core of the simulation platform is MetaSim, an infrastructure that abstracts diverse simulation environments into a universal interface. It restructures existing simulation environments into a simulator-agnostic configuration system, as well as an API aligning different simulator functionalities, such as launching simulation environments, loading assets with initial states, stepping the physics engine, etc. This abstraction ensures interoperability and extensibility. Comprehensive experiments demonstrate that RoboVerse enhances the performance of imitation learning, reinforcement learning, world model learning, and sim-to-real transfer. These results validate the reliability of our dataset and benchmarks, establishing RoboVerse as a robust solution for advancing robot learning.
中文: RoboVerse提出一个包含仿真平台、合成数据集和统一基准的综合框架,以解决机器人技术的数据扩展和评估难题,有效提升了模仿学习、强化学习及仿真到现实的迁移性能。
English: RoboVerse is introduced as a comprehensive framework with a simulation platform, synthetic dataset, and unified benchmarks to address robotics' data scaling and evaluation challenges, enhancing learning methods and sim-to-real transfer.

Authors:Junyan Zhang, Shuliang Liu, Aiwei Liu, Yubo Gao, Jungang Li, Xiaojie Gu, Xuming Hu
Title: CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality
Abstract:
Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.
Chinese: CoheMark是一种先进的句子级别水印技术,通过利用句子间的连贯性来提升逻辑流畅度,并借助模糊C均值聚类和特定选择标准,在保持文本质量的同时实现强大的水印检测能力。
English: CoheMark is an advanced sentence-level watermarking technique that enhances logical fluency by leveraging inter-sentence cohesion, achieving robust watermark detection with minimal impact on text quality through fuzzy c-means clustering and specific selection criteria.

Authors:Chengkai Huang, Hongtao Huang, Tong Yu, Kaige Xie, Junda Wu, Shuai Zhang, Julian Mcauley, Dietmar Jannach, Lina Yao
Title: A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
Abstract:
Recommender systems (RS) have become essential in filtering information and personalizing content for users. RS techniques have traditionally relied on modeling interactions between users and items as well as the features of content using models specific to each task. The emergence of foundation models (FMs), large scale models trained on vast amounts of data such as GPT, LLaMA and CLIP, is reshaping the recommendation paradigm. This survey provides a comprehensive overview of the Foundation Models for Recommender Systems (FM4RecSys), covering their integration in three paradigms: (1) Feature-Based augmentation of representations, (2) Generative recommendation approaches, and (3) Agentic interactive systems. We first review the data foundations of RS, from traditional explicit or implicit feedback to multimodal content sources. We then introduce FMs and their capabilities for representation learning, natural language understanding, and multi-modal reasoning in RS contexts. The core of the survey discusses how FMs enhance RS under different paradigms. Afterward, we examine FM applications in various recommendation tasks. Through an analysis of recent research, we highlight key opportunities that have been realized as well as challenges encountered. Finally, we outline open research directions and technical challenges for next-generation FM4RecSys. This survey not only reviews the state-of-the-art methods but also provides a critical analysis of the trade-offs among the feature-based, the generative, and the agentic paradigms, outlining key open issues and future research directions.
推荐系统正通过GPT和LLaMA等基础模型进行革新,该综述全面探讨了其在特征增强、生成式推荐和智能交互范式中的应用、机遇与挑战。
Recommender systems are evolving with foundation models like GPT and LLaMA, which enhance them through feature augmentation, generative approaches, and agentic interactions, as surveyed in this comprehensive overview of opportunities and challenges.

Authors:Phuong Quynh Le, Christin Seifert, Jörg Schlötterer
Title: Invariant Learning with Annotation-free Environments
Abstract:
Invariant learning is a promising approach to improve domain generalization compared to Empirical Risk Minimization (ERM). However, most invariant learning methods rely on the assumption that training examples are pre-partitioned into different known environments. We instead infer environments without the need for additional annotations, motivated by observations of the properties within the representation space of a trained ERM model. We show the preliminary effectiveness of our approach on the ColoredMNIST benchmark, achieving performance comparable to methods requiring explicit environment labels and on par with an annotation-free method that poses strong restrictions on the ERM reference model.
中文摘要:不变学习通过从训练好的ERM模型的表征空间中推断环境,无需预标注数据即可提升领域泛化能力,在ColoredMNIST基准测试中取得了与依赖环境标签方法相媲美的性能。
English Summary: Invariant learning enhances domain generalization over ERM by inferring environments from a trained model's representation space, eliminating the need for pre-labeled data, as demonstrated by competitive results on ColoredMNIST.

Authors:Phuong Quynh Le, Jörg Schlötterer, Christin Seifert
Title: An XAI-based Analysis of Shortcut Learning in Neural Networks
Abstract:
Machine learning models tend to learn spurious features - features that strongly correlate with target labels but are not causal. Existing approaches to mitigate models' dependence on spurious features work in some cases, but fail in others. In this paper, we systematically analyze how and where neural networks encode spurious correlations. We introduce the neuron spurious score, an XAI-based diagnostic measure to quantify a neuron's dependence on spurious features. We analyze both convolutional neural networks (CNNs) and vision transformers (ViTs) using architecture-specific methods. Our results show that spurious features are partially disentangled, but the degree of disentanglement varies across model architectures. Furthermore, we find that the assumptions behind existing mitigation methods are incomplete. Our results lay the groundwork for the development of novel methods to mitigate spurious correlations and make AI models safer to use in practice.
中文摘要:本文提出神经元伪相关评分来系统分析神经网络如何编码伪相关特征,揭示了不同架构中特征解耦程度的差异,并指出现有缓解方法的局限性。
English Summary: This paper introduces a neuron spurious score to systematically analyze how neural networks encode spurious correlations, revealing varying degrees of feature disentanglement across architectures and highlighting limitations in current mitigation methods.

Authors:Jiaxin GUO, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Zongyao Li, Hengchao Shang, Daimeng Wei, Hao Yang
Title: Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
Abstract:
With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.
中文摘要:本文系统分析了文档级机器翻译自动评估方法的现状与挑战,指出当前评估存在参考译文多样性不足、LLM评判方法存在偏差等问题,并展望了减少对句子级信息依赖、开发专用评估模型等未来研究方向。
English Summary: This paper analyzes the current state and challenges of automatic evaluation methods for document-level machine translation, highlighting issues like limited reference diversity and the shortcomings of LLM-based assessment, while proposing future research directions to enhance evaluation accuracy and robustness.

Authors:Hongwei Ji, Wulian Yun, Mengshi Qi, Huadong Ma
Title: Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization
Abstract:
Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.
中文: 本文提出了一种新颖的少样本时序动作定位方法,通过结合思维链文本推理和语义感知的文本-视觉对齐机制,在多个数据集上显著超越了现有方法的性能表现。
English: This paper introduces a novel few-shot temporal action localization method that enhances performance by integrating Chain-of-Thought textual reasoning and semantic-aware text-visual alignment, significantly outperforming existing approaches across multiple datasets.

Authors:Changsheng Lv, Mengshi Qi, Zijian Fu, Huadong Ma
Title: Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation
Abstract:
In this paper, we introduce a novel method named Robo-SGG, i.e., Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation. Compared to the existing SGG setting, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to compromised visual features e.g., corruption interference or occlusions. To obtain robust visual features, we exploit the layout information, which is domain-invariant, to enhance the efficacy of existing SGG methods on corrupted images. Specifically, we employ Instance Normalization(IN) to filter out the domain-specific feature and recover the unchangeable structural features, i.e., the positional and semantic relationships among objects by the proposed Layout-Oriented Restitution. Additionally, we propose a Layout-Embedded Encoder (LEE) that augments the existing object and predicate encoders within the SGG framework, enriching the robust positional and semantic features of objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 5.6%, 8.0%, and 6.5% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C dataset, respectively, and achieve new state-of-the-art performance in corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.
中文: 本文提出Robo-SGG模块,通过布局导向的归一化与还原机制,利用领域不变的布局信息增强场景图生成模型在受损图像上的鲁棒性,可即插即用地提升现有方法的性能。
English: This paper introduces Robo-SGG, a plug-and-play module that uses layout-oriented normalization and restitution to enhance scene graph generation robustness against corrupted images by leveraging domain-invariant structural features.

Authors:Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Xiangyang Xue, Yi Zhu
Title: CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
Abstract:
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demonstrate that our method significantly outperforms the state-of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility. Our dataset, code and pre-trained models are available on the project page.
中文: 本文提出了CAP-Net单阶段网络,通过融合RGB-D特征实现端到端的关节部件6D姿态与尺寸估计,并发布了RGBD-Art数据集以弥合仿真与现实的差距,实验证明其性能显著优于现有最优方法。
English: This paper introduces CAP-Net, a single-stage network that combines RGB-D features for end-to-end 6D pose and size estimation of articulated object parts, and presents the RGBD-Art dataset to bridge the sim-to-real gap, demonstrating superior performance over existing methods.

Authors:Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian
Title: CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates
Abstract:
The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/
中文摘要:CleanMAP是一个基于多模态大语言模型的蒸馏框架,通过评估车道可见性并融合高质量局部地图来优化众包高清地图更新,在自动驾驶导航中实现了卓越的精度和与人工评估的高度一致性。
English Summary: CleanMAP is a multimodal large language model-based framework that refines crowdsourced data for high-definition map updates by scoring lane visibility and fusing top-quality local maps, achieving superior accuracy and human alignment in autonomous navigation.

Authors:Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang
Title: DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify
Abstract:
Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.
中文: 大型语言模型应从模糊的整体响应转向结构化、多智能体的工作流程,如DataPuzzle框架,通过分解问题与协调角色实现可验证的透明分析,构建可靠洞察。
English: Large language models (LLMs) should shift from opaque, monolithic responses to structured, multi-agent workflows like DataPuzzle, enabling transparent reasoning and verifiable analysis for trustworthy insights.

Authors:Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang
Title: DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis
Abstract:
Large language models (LLMs) are increasingly applied to multi-modal data analysis -- not necessarily because they offer the most precise answers, but because they provide fluent, flexible interfaces for interpreting complex inputs. Yet this fluency often conceals a deeper structural failure: the prevailing ``Prompt-to-Answer'' paradigm treats LLMs as black-box analysts, collapsing evidence, reasoning, and conclusions into a single, opaque response. The result is brittle, unverifiable, and frequently misleading. We argue for a fundamental shift: from generation to structured extraction, from monolithic prompts to modular, agent-based workflows. LLMs should not serve as oracles, but as collaborators -- specialized in tasks like extraction, translation, and linkage -- embedded within transparent workflows that enable step-by-step reasoning and verification. We propose DataPuzzle, a conceptual multi-agent framework that decomposes complex questions, structures information into interpretable forms (e.g. tables, graphs), and coordinates agent roles to support transparent and verifiable analysis. This framework serves as an aspirational blueprint for restoring visibility and control in LLM-driven analytics -- transforming opaque answers into traceable processes, and brittle fluency into accountable insight. This is not a marginal refinement; it is a call to reimagine how we build trustworthy, auditable analytic systems in the era of large language models. Structure is not a constraint -- it is the path to clarity.
中文: 大型语言模型应从模糊的整体响应转向结构化、多智能体的工作流程,如DataPuzzle框架,通过分解问题与协调角色实现可验证的透明分析,构建可靠洞察。
English: Large language models (LLMs) should shift from opaque, monolithic responses to structured, multi-agent workflows like DataPuzzle, enabling transparent reasoning and verifiable analysis for trustworthy insights.

Authors:Chen Yan, Boyu Diao, Hangda Liu, Zhulin An, Yongjun Xu
Title: A Nonlinear Hash-based Optimization Method for SpMV on GPUs
Abstract:
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.
中文: 本文提出基于哈希的分区(HBP)格式,利用哈希技术优化稀疏矩阵重排序,在预处理和稀疏矩阵向量乘法性能上均实现显著加速。
English: This paper introduces a Hash-based Partition (HBP) format that optimizes sparse matrix reordering using hash techniques, achieving significant speedups in both preprocessing and SpMV performance on tested hardware.

Authors:Hongcheng Guo, Fei Zhao, Shaosheng Cao, Xinze Lyu, Ziyan Liu, Yue Wang, Boyang Wang, Zhoujun Li, Chonggang Lu, Zhe Xu, Yao Hu
Title: Redefining Machine Translation on Social Network Services with Large Language Models
Abstract:
The globalization of social interactions has heightened the need for machine translation (MT) on Social Network Services (SNS), yet traditional models struggle with culturally nuanced content like memes, slang, and pop culture references. While large language models (LLMs) have advanced general-purpose translation, their performance on SNS-specific content remains limited due to insufficient specialized training data and evaluation benchmarks. This paper introduces RedTrans, a 72B LLM tailored for SNS translation, trained on a novel dataset developed through three innovations: (1) Supervised Finetuning with Dual-LLM Back-Translation Sampling, an unsupervised sampling method using LLM-based back-translation to select diverse data for large-scale finetuning; (2) Rewritten Preference Optimization (RePO), an algorithm that identifies and corrects erroneous preference pairs through expert annotation, building reliable preference corpora; and (3) RedTrans-Bench, the first benchmark for SNS translation, evaluating phenomena like humor localization, emoji semantics, and meme adaptation. Experiments show RedTrans outperforms state-of-the-art LLMs. Besides, RedTrans has already been deployed in a real-world production environment, demonstrating that domain-specific adaptation, effectively bridges the gap between generic and culturally grounded translation systems.
中文: 本文提出专用于社交网络翻译的72B大模型RedTrans,通过创新训练方法和专用基准测试解决了文化敏感内容的翻译难题,展现出卓越性能并已投入实际应用。
English: This paper introduces RedTrans, a specialized 72B LLM for SNS translation, which overcomes limitations in handling culturally nuanced content through innovative training methods and a dedicated benchmark, demonstrating superior performance and real-world deployment.

Authors:Hongcheng Guo, Juntao Yao, Boyang Wang, Junjia Du, Shaosheng Cao, Donglin Di, Shun Zhang, Zhoujun Li
Title: Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models
Abstract:
Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.
中文:混合专家架构能高效扩展大语言模型,但因参数冗余面临部署挑战;提出的聚类驱动专家剪枝方法通过分层聚类并全局消除相似专家,有效压缩模型规模。
English: Mixture-of-Experts architectures enable efficient scaling of large language models but face deployment challenges due to parameter redundancy, which the proposed Cluster-driven Expert Pruning method addresses by grouping and eliminating similar experts across layers to compress models effectively.

Authors:Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel
Title: Perception in Reflection
Abstract:
We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.
中文摘要:该研究提出了一种反射感知(RePer)框架,通过双模型机制迭代优化大型视觉语言模型的视觉感知能力,显著提升了图像理解、描述精度并减少幻觉现象,同时实现了模型注意力与人类视觉焦点的有效对齐。
English Summary: The study introduces a Reflective Perception (RePer) framework that uses a dual-model mechanism to iteratively enhance visual perception in large vision-language models, significantly improving image understanding, captioning accuracy, and reducing hallucinations while aligning model attention with human focus.

Authors:Mengchen Zhang, Tong Wu, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin
Title: GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Abstract:
Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.
中文: 本文提出了GenDoP模型,基于DataDoP数据集训练的自回归Transformer,能根据文本引导生成富有艺术表现力的相机轨迹,在可控性和运动稳定性上优于现有方法。
English: This paper introduces GenDoP, an auto-regressive Transformer model trained on the DataDoP dataset to generate artistic and text-aligned camera trajectories, offering superior controllability and motion stability over existing methods.

Authors:Yin Wu, Zhengxuan Zhang, Fuling Wang, Yuyu Luo, Hui Xiong, Nan Tang
Title: EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval
Abstract:
Misinformation continues to pose a significant challenge in today's information ecosystem, profoundly shaping public perception and behavior. Among its various manifestations, Out-of-Context (OOC) misinformation is particularly obscure, as it distorts meaning by pairing authentic images with misleading textual narratives. Existing methods for detecting OOC misinformation predominantly rely on coarse-grained similarity metrics between image-text pairs, which often fail to capture subtle inconsistencies or provide meaningful explainability. While multi-modal large language models (MLLMs) demonstrate remarkable capabilities in visual reasoning and explanation generation, they have not yet demonstrated the capacity to address complex, fine-grained, and cross-modal distinctions necessary for robust OOC detection. To overcome these limitations, we introduce EXCLAIM, a retrieval-based framework designed to leverage external knowledge through multi-granularity index of multi-modal events and entities. Our approach integrates multi-granularity contextual analysis with a multi-agent reasoning architecture to systematically evaluate the consistency and integrity of multi-modal news content. Comprehensive experiments validate the effectiveness and resilience of EXCLAIM, demonstrating its ability to detect OOC misinformation with 4.3% higher accuracy compared to state-of-the-art approaches, while offering explainable and actionable insights.
中文摘要:EXCLAIM框架通过多粒度上下文分析和多智能体推理系统,有效提升了上下文误导信息的检测能力,其准确率比现有最优方法提高4.3%,并能提供可解释的分析结果。
English Summary: The EXCLAIM framework enhances out-of-context misinformation detection by integrating multi-granularity contextual analysis with multi-agent reasoning, achieving 4.3% higher accuracy than existing methods while providing explainable insights.

Authors:Mrityunjoy Gain, Kitae Kim, Avi Deb Raha, Apurba Adhikary, Eui-Nam Huh, Zhu Han, Choong Seon Hong
Title: FedFeat+: A Robust Federated Learning Framework Through Federated Aggregation and Differentially Private Feature-Based Classifier Retraining
Abstract:
In this paper, we propose the FedFeat+ framework, which distinctively separates feature extraction from classification. We develop a two-tiered model training process: following local training, clients transmit their weights and some features extracted from the feature extractor from the final local epochs to the server. The server aggregates these models using the FedAvg method and subsequently retrains the global classifier utilizing the shared features. The classifier retraining process enhances the model's understanding of the holistic view of the data distribution, ensuring better generalization across diverse datasets. This improved generalization enables the classifier to adaptively influence the feature extractor during subsequent local training epochs. We establish a balance between enhancing model accuracy and safeguarding individual privacy through the implementation of differential privacy mechanisms. By incorporating noise into the feature vectors shared with the server, we ensure that sensitive data remains confidential. We present a comprehensive convergence analysis, along with theoretical reasoning regarding performance enhancement and privacy preservation. We validate our approach through empirical evaluations conducted on benchmark datasets, including CIFAR-10, CIFAR-100, MNIST, and FMNIST, achieving high accuracy while adhering to stringent privacy guarantees. The experimental results demonstrate that the FedFeat+ framework, despite using only a lightweight two-layer CNN classifier, outperforms the FedAvg method in both IID and non-IID scenarios, achieving accuracy improvements ranging from 3.92 % to 12.34 % across CIFAR-10, CIFAR-100, and Fashion-MNIST datasets.
中文:FedFeat+框架将特征提取与分类分离,通过双层训练和差分隐私机制在提升模型泛化能力和精度的同时保护数据隐私,在多个基准数据集上显著优于FedAvg方法。
English: The FedFeat+ framework separates feature extraction from classification, using a two-tiered training process with differential privacy to enhance generalization and accuracy while protecting data confidentiality, achieving significant improvements over FedAvg across multiple datasets.

Authors:Yining Shi, Kun Jiang, Xin Zhao, Kangan Qian, Chuchu Xie, Tuopu Wen, Mengmeng Yang, Diange Yang
Title: POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point Cloud
Abstract:
LiDAR-based 3D object detection is a fundamental task in the field of autonomous driving. This paper explores the unique advantage of Frequency Modulated Continuous Wave (FMCW) LiDAR in autonomous perception. Given a single frame FMCW point cloud with radial velocity measurements, we expect that our object detector can detect the short-term future locations of objects using only the current frame sensor data and demonstrate a fast ability to respond to intermediate danger. To achieve this, we extend the standard object detection task to a novel task named predictive object detection (POD), which aims to predict the short-term future location and dimensions of objects based solely on current observations. Typically, a motion prediction task requires historical sensor information to process the temporal contexts of each object, while our detector's avoidance of multi-frame historical information enables a much faster response time to potential dangers. The core advantage of FMCW LiDAR lies in the radial velocity associated with every reflected point. We propose a novel POD framework, the core idea of which is to generate a virtual future point using a ray casting mechanism, create virtual two-frame point clouds with the current and virtual future frames, and encode these two-frame voxel features with a sparse 4D encoder. Subsequently, the 4D voxel features are separated by temporal indices and remapped into two Bird's Eye View (BEV) features: one decoded for standard current frame object detection and the other for future predictive object detection. Extensive experiments on our in-house dataset demonstrate the state-of-the-art standard and predictive detection performance of the proposed POD framework.
中文: 本文提出基于FMCW激光雷达的预测性目标检测框架,通过虚拟点云生成和4D特征编码预测短期目标位置,在实现快速响应的同时获得了领先的检测性能。
English: This paper introduces a predictive object detection framework using FMCW LiDAR that forecasts short-term object positions through virtual point generation and 4D feature encoding, achieving state-of-the-art performance with rapid response capabilities.

Authors:Zhang Xi-Jia, Yue Guo, Shufei Chen, Simon Stepputtis, Matthew Gombolay, Katia Sycara, Joseph Campbell
Title: Model-Agnostic Policy Explanations with Large Language Models
Abstract:
Intelligent agents, such as robots, are increasingly deployed in real-world, human-centric environments. To foster appropriate human trust and meet legal and ethical standards, these agents must be able to explain their behavior. However, state-of-the-art agents are typically driven by black-box models like deep neural networks, limiting their interpretability. We propose a method for generating natural language explanations of agent behavior based only on observed states and actions -- without access to the agent's underlying model. Our approach learns a locally interpretable surrogate model of the agent's behavior from observations, which then guides a large language model to generate plausible explanations with minimal hallucination. Empirical results show that our method produces explanations that are more comprehensible and correct than those from baselines, as judged by both language models and human evaluators. Furthermore, we find that participants in a user study more accurately predicted the agent's future actions when given our explanations, suggesting improved understanding of agent behavior.
Chinese: 本文提出一种方法,通过观测数据和局部可解释的替代模型为智能体行为生成自然语言解释,无需访问其内部模型,从而提高了可理解性和用户对行为的预测准确性。
English: This paper introduces a method that generates natural language explanations for intelligent agents' behavior using observed data and a locally interpretable surrogate model, enhancing comprehensibility and user understanding without accessing the agent's internal model.

Authors:Yichen Dong, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, Hao Yang
Title: Two Intermediate Translations Are Better Than One: Fine-tuning LLMs for Document-level Translation Refinement
Abstract:
Recent research has shown that large language models (LLMs) can enhance translation quality through self-refinement. In this paper, we build on this idea by extending the refinement from sentence-level to document-level translation, specifically focusing on document-to-document (Doc2Doc) translation refinement. Since sentence-to-sentence (Sent2Sent) and Doc2Doc translation address different aspects of the translation process, we propose fine-tuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc. Additionally, recognizing that the quality of intermediate translations varies, we introduce an enhanced fine-tuning method with quality awareness that assigns lower weights to easier translations and higher weights to more difficult ones, enabling the model to focus on challenging translation cases. Experimental results across ten translation tasks with LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct demonstrate the effectiveness of our approach.
中文: 本研究通过采用质量感知的微调方法,优先处理困难翻译案例,提升了大型语言模型在文档级翻译中的表现,并在多项任务中验证了其有效性。
English: This study advances document-level translation by fine-tuning large language models with quality-aware methods that prioritize challenging cases, demonstrating improved performance across multiple tasks.

Authors:Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, Xianzhi Du
Title: Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
Abstract:
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe to identify the least knowledgeable subset of experts that can be dropped with minimal impact on performance? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during iterative expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.
中文: 本研究提出MC-Suite套件来评估专家重要性,采用迭代剪枝与任务无关微调以减少稀疏激活专家混合模型的性能损失,并发现指令跟随能力是最易受损环节,需通过示例增强和监督微调进行针对性恢复。
English: This study introduces MC-Suite to benchmark expert importance evaluation methods and proposes iterative pruning with task-agnostic fine-tuning to mitigate performance loss in sparsely activated Mixture-of-Experts models, while identifying instruction-following capability as the most vulnerable aspect requiring targeted recovery.

Authors:Yichen Li, Qiyu Qin, Gaoyang Zhu, Wenchao Xu, Haozhao Wang, Yuhua Li, Rui Zhang, Ruixuan Li
Title: A Systematic Survey on Federated Sequential Recommendation
Abstract:
Sequential recommendation is an advanced recommendation technique that utilizes the sequence of user behaviors to generate personalized suggestions by modeling the temporal dependencies and patterns in user preferences. However, it requires a server to centrally collect users' data, which poses a threat to the data privacy of different users. In recent years, federated learning has emerged as a distributed architecture that allows participants to train a global model while keeping their private data locally. This survey pioneers Federated Sequential Recommendation (FedSR), where each user joins as a participant in federated training to achieve a recommendation service that balances data privacy and model performance. We begin with an introduction to the background and unique challenges of FedSR. Then, we review existing solutions from two levels, each of which includes two specific techniques. Additionally, we discuss the critical challenges and future research directions in FedSR.
中文摘要:本调查开创性地提出联邦顺序推荐(FedSR),通过将顺序推荐与联邦学习相结合,让用户作为参与者进行本地训练,在保护数据隐私的同时实现推荐服务性能的平衡。
English Summary: This survey introduces Federated Sequential Recommendation (FedSR), a novel approach that combines sequential recommendation with federated learning to protect user privacy while maintaining model performance by training locally without centralizing data.

Authors:Avi Deb Raha, Kitae Kim, Mrityunjoy Gain, Apurba Adhikary, Zhu Han, Eui-Nam Huh, Choong Seon Hong
Title: Security Risks in Vision-Based Beam Prediction: From Spatial Proxy Attacks to Feature Refinement
Abstract:
The rapid evolution towards the sixth-generation (6G) networks demands advanced beamforming techniques to address challenges in dynamic, high-mobility scenarios, such as vehicular communications. Vision-based beam prediction utilizing RGB camera images emerges as a promising solution for accurate and responsive beam selection. However, reliance on visual data introduces unique vulnerabilities, particularly susceptibility to adversarial attacks, thus potentially compromising beam accuracy and overall network reliability. In this paper, we conduct the first systematic exploration of adversarial threats specifically targeting vision-based mmWave beam selection systems. Traditional white-box attacks are impractical in this context because ground-truth beam indices are inaccessible and spatial dynamics are complex. To address this, we propose a novel black-box adversarial attack strategy, termed Spatial Proxy Attack (SPA), which leverages spatial correlations between user positions and beam indices to craft effective perturbations without requiring access to model parameters or labels. To counteract these adversarial vulnerabilities, we formulate an optimization framework aimed at simultaneously enhancing beam selection accuracy under clean conditions and robustness against adversarial perturbations. We introduce a hybrid deep learning architecture integrated with a dedicated Feature Refinement Module (FRM), designed to systematically filter irrelevant, noisy and adversarially perturbed visual features. Evaluations using standard backbone models such as ResNet-50 and MobileNetV2 demonstrate that our proposed method significantly improves performance, achieving up to an +21.07\% gain in Top-K accuracy under clean conditions and a 41.31\% increase in Top-1 adversarial robustness compared to different baseline models.
中文: 本文针对6G网络中基于视觉的波束选择系统,提出了一种黑盒对抗攻击策略,并设计了一种集成特征优化模块的混合深度学习架构,以同时提升系统在正常和受攻击条件下的性能与鲁棒性。
English: This paper introduces a black-box adversarial attack strategy for vision-based beam selection in 6G networks and proposes a hybrid deep learning architecture with a feature refinement module to enhance both accuracy and robustness against such threats.

Authors:Xinglin Lyu, Wei Tang, Yuang Li, Xiaofeng Zhao, Ming Zhu, Junhui Li, Yunfei Lu, Min Zhang, Daimeng Wei, Hao Yang, Min Zhang
Title: DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation
Abstract:
Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
中文:DoCIA框架通过在多阶段引入基于大语言模型的文档级上下文,显著提升了语音翻译在句子和篇章层面的表现,同时有效控制了计算开销。
English: The DoCIA framework enhances speech translation by integrating document-level context through LLM-based modules across multiple stages, significantly outperforming traditional methods in both sentence and discourse metrics while minimizing computational costs.

Authors:Jiabao Guo, Ajian Liu, Yunfeng Diao, Jin Zhang, Hui Ma, Bo Zhao, Richang Hong, Meng Wang
Title: Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
Abstract:
The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.
中文摘要:本文提出了一种内容感知复合提示工程方法,通过生成结合大语言模型知识和自适应视觉内容提取的实例级提示,解决了人脸活体检测中领域泛化的局限性,并取得了最先进的性能。
English Summary: This paper introduces a Content-aware Composite Prompt Engineering method to overcome limitations in domain generalization for face anti-spoofing by generating instance-wise prompts that leverage both large language model knowledge and adaptive visual content extraction, achieving state-of-the-art performance.

Authors:Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang, Haizhou Li
Title: Causal Self-supervised Pretrained Frontend with Predictive Code for Speech Separation
Abstract:
Speech separation (SS) seeks to disentangle a multi-talker speech mixture into single-talker speech streams. Although SS can be generally achieved using offline methods, such a processing paradigm is not suitable for real-time streaming applications. Causal separation models, which rely only on past and present information, offer a promising solution for real-time streaming. However, these models typically suffer from notable performance degradation due to the absence of future context. In this paper, we introduce a novel frontend that is designed to mitigate the mismatch between training and run-time inference by implicitly incorporating future information into causal models through predictive patterns. The pretrained frontend employs a transformer decoder network with a causal convolutional encoder as the backbone and is pretrained in a self-supervised manner with two innovative pretext tasks: autoregressive hybrid prediction and contextual knowledge distillation. These tasks enable the model to capture predictive patterns directly from mixtures in a self-supervised manner. The pretrained frontend subsequently serves as a feature extractor to generate high-quality predictive patterns. Comprehensive evaluations on synthetic and real-world datasets validated the effectiveness of the proposed pretrained frontend.
Chinese: 本文提出一种自监督前端,通过预测模式为因果语音分离模型隐式融入未来信息,有效弥补实时流处理中上下文缺失,显著提升分离性能。
English: This paper introduces a self-supervised frontend that enhances causal speech separation models by incorporating predictive patterns to compensate for the absence of future context, thereby improving real-time streaming performance.

Authors:Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li
Title: T*: Re-thinking Temporal Search for Long-Form Video Understanding
Abstract:
Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.
Chinese: 本研究通过引入LV-Haystack数据集并提出将时序搜索重构为空间搜索的轻量级框架T*,解决了长视频时序搜索的挑战,显著提升了现有最佳方法的性能。
English: This work addresses the challenge of temporal search in long-form video understanding by introducing the LV-Haystack dataset and proposing T*, a lightweight framework that reframes temporal search as spatial search, significantly improving state-of-the-art performance.

Authors:Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong
Title: More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Abstract:
Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.
中文: 直接偏好优化(DPO)利用合成偏好数据有效对齐大语言模型与人类价值观,但多模型生成数据在提升通用任务性能的同时,会因线性可分性导致奖励攻击漏洞,显著增加模型遭遇越狱攻击的安全风险。
English: Direct Preference Optimization (DPO) effectively aligns large language models with human values using synthetic preference data, but multi-model generated data, while boosting general task performance, increases safety risks by enabling reward hacking and higher attack success rates against jailbreaking prompts.

Authors:Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xinbing Liang, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu
Title: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Abstract:
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This book provides a comprehensive overview, framing intelligent agents within modular, brain-inspired architectures that integrate principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we systematically investigate the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities and elucidating core components such as memory, world modeling, reward processing, goal, and emotion. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms. Third, we examine multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures. Finally, we address the critical imperative of building safe and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment. By synthesizing modular AI architectures with insights from different disciplines, this survey identifies key research challenges and opportunities, encouraging innovations that harmonize technological advancement with meaningful societal benefit.
中文: 本书提出基于脑科学启发的模块化架构来构建智能代理,系统阐述其认知基础、自我增强机制、多智能体系统和安全框架,旨在推动人工智能技术与社会效益的协同发展。
English: This book provides a comprehensive framework for developing intelligent agents through brain-inspired modular architectures, covering cognitive foundations, self-improvement mechanisms, multi-agent systems, and safety considerations to advance AI research and applications.

Authors:Kegang Wang, Jiankai Tang, Yuxuan Fan, Jiatong Ji, Yuanchun Shi, Yuntao Wang
Title: Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality
Abstract:
Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency -- surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on https://health-hci-group.github.io/ME-rPPG-demo/.
中文: 本文提出ME-rPPG算法,通过时空状态空间对偶性解决了远程光电容积描记技术中的计算瓶颈,在模型可扩展性、跨数据集泛化能力和实时性之间取得平衡,在跨数据集测试和实际部署中实现了更高的精度与效率。
English: This paper introduces ME-rPPG, a memory-efficient algorithm that overcomes computational bottlenecks in remote photoplethysmography by balancing model scalability, generalization, and real-time performance through temporal-spatial state space duality, achieving superior accuracy and efficiency in cross-dataset evaluations and real-world deployments.

Authors:Mrityunjoy Gain, Kitae Kim, Avi Deb Raha, Apurba Adhikary, Walid Saad, Zhu Han, Choong Seon Hong
Title: AI-Driven Framework for Multi-Service Multi-Modal Devices in NextG ORAN Systems
Abstract:
In this paper, an artificial intelligence (AI)-driven efficient RAN management framework is proposed. This framework introduces the concept of the multi-service-modal UE (MSMU) system, which allows a single UE to handle both eMBB and uRLLC services. The proposed framework integrates traffic demand prediction, route optimization, RAN slicing, service identification, and radio resource management under uncertainty. The challenge of dynamic environments in such a system is addressed by decomposing the optimization problem into long-term (L-SP) and short-term (S-SP) subproblems. Using a long short-term memory (LSTM) model, the proposed approach allows the prediction of eMBB and uRLLC traffic demands and optimal routes for RAN slicing in the L-SP. For the S-SP, another LSTM model is employed to handle real-time service type identification and resource management based on long-term predictions. To support continuous adaptation, continual learning is incorporated into the S-SP framework, allowing the model to learn new service types while retaining prior knowledge. Experimental results show that the proposed framework efficiently manages dual-mode UEs, achieving low mean square error for traffic demand (0.003), resource block prediction (0.003), and power prediction (0.002), with 99\% accuracy in service type and route selection and over 95\% average accuracy for continual service adaptation across seven tasks.
中文: 本文提出了一种用于高效无线接入网管理的深度增量框架,通过多业务模式用户设备和集成可逆实例归一化的Transformer-LSTM模型,在流量预测、资源管理和持续学习方面实现了显著性能提升。
English: This paper introduces a deep incremental framework with a Multi-Service-Modal UE system for simultaneous eMBB and uRLLC services, employing Transformer and LSTM models with reversible instance normalization for traffic prediction and resource management, achieving significant improvements in prediction accuracy and resource efficiency.

Authors:Mrityunjoy Gain, Kitae Kim, Avi Deb Raha, Apurba Adhikary, Walid Saad, Zhu Han, Choong Seon Hong
Title: A Deep Incremental Framework for Multi-Service Multi-Modal Devices in NextG AI-RAN Systems
Abstract:
In this paper, we propose a deep incremental framework for efficient RAN management, introducing the Multi-Service-Modal UE (MSMU) system, which enables a single UE to handle eMBB and uRLLC services simultaneously. We formulate an optimization problem integrating traffic demand prediction, route optimization, RAN slicing, service identification, and radio resource management under uncertainty. We decompose it into long-term (L-SP) and short-term (S-SP) subproblems then propose a Transformer model for L-SP optimization, predicting eMBB and uRLLC traffic demands and optimizing routes for RAN slicing. To address non-stationary network traffic with evolving trends and scale variations, we integrate reversible instance normalization (ReVIN) into the forecasting pipeline. For the S-SP, we propose an LSTM model enabling real-time service type identification and resource management, utilizing L-SP predictions. We incorporate continual learning into the S-SP framework to adapt to new service types while preserving prior knowledge. Experimental results demonstrate that our proposed framework achieves up to 46.86% reduction in traffic demand prediction error, 26.70% and 18.79% improvement in PRBs and power estimation, 7.23% higher route selection accuracy, and 7.29% improvement in service identification over the baselines with 95% average accuracy in continual service identification across seven sequential tasks.
中文: 本文提出了一种用于高效无线接入网管理的深度增量框架,通过多业务模式用户设备和集成可逆实例归一化的Transformer-LSTM模型,在流量预测、资源管理和持续学习方面实现了显著性能提升。
English: This paper introduces a deep incremental framework with a Multi-Service-Modal UE system for simultaneous eMBB and uRLLC services, employing Transformer and LSTM models with reversible instance normalization for traffic prediction and resource management, achieving significant improvements in prediction accuracy and resource efficiency.

Authors:Jincheng Zhong, Xiangcheng Zhang, Jianmin Wang, Mingsheng Long
Title: Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model
Abstract:
Recent advancements in diffusion models have revolutionized generative modeling. However, the impressive and vivid outputs they produce often come at the cost of significant model scaling and increased computational demands. Consequently, building personalized diffusion models based on off-the-shelf models has emerged as an appealing alternative. In this paper, we introduce a novel perspective on conditional generation for transferring a pre-trained model. From this viewpoint, we propose *Domain Guidance*, a straightforward transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain. Domain Guidance shares a formulation similar to advanced classifier-free guidance, facilitating better domain alignment and higher-quality generations. We provide both empirical and theoretical analyses of the mechanisms behind Domain Guidance. Our experimental results demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 23.4% improvement in FD$_\text{DINOv2}$ compared to standard fine-tuning. Notably, existing fine-tuned models can seamlessly integrate Domain Guidance to leverage these benefits, without additional training.
中文: 本文提出领域引导方法,通过利用预训练扩散模型在无需额外训练的情况下提升领域对齐和生成质量,相比标准微调实现了显著改进。
English: This paper introduces Domain Guidance, a novel conditional generation method that leverages pre-trained diffusion models to enhance domain alignment and generation quality without additional training, achieving significant improvements over standard fine-tuning.

Authors:Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su
Title: An Illusion of Progress? Assessing the Current State of Web Agents
Abstract:
As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
中文: 本研究提出了Online-Mind2Web在线评估基准,通过模拟真实使用场景全面测评网络智能体,发现现有研究对其能力存在过度乐观的评估差距,并开发出与人类判断高度一致的自动评估方法,为未来发展指明方向。
English: This study introduces Online-Mind2Web, a comprehensive benchmark for evaluating web agents under realistic conditions, revealing a significant overestimation of current capabilities in prior research and proposing an automated evaluation method with high human agreement to guide future development.

Authors:Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su
Title: An Illusion of Progress? Assessing the Current State of Web Agents
Abstract:
As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
中文: 本研究提出了Online-Mind2Web在线评估基准,通过模拟真实使用场景全面测评网络智能体,发现现有研究对其能力存在过度乐观的评估差距,并开发出与人类判断高度一致的自动评估方法,为未来发展指明方向。
English: This study introduces Online-Mind2Web, a comprehensive benchmark for evaluating web agents under realistic conditions, revealing a significant overestimation of current capabilities in prior research and proposing an automated evaluation method with high human agreement to guide future development.

Authors:Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, Jiajun Wu
Title: WorldScore: A Unified Evaluation Benchmark for World Generation
Abstract:
We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/
中文: WorldScore是首个统一的世界生成基准,通过将生成过程分解为具有明确布局规范的连续场景任务,评估了19个模型在可控性、质量和动态性方面的表现,揭示了各类模型的关键见解与挑战。
English: The WorldScore benchmark is the first unified framework for evaluating world generation models by decomposing the process into sequential scene generation tasks with explicit layout specifications, assessing 19 models across controllability, quality, and dynamics to reveal key insights and challenges.

Authors:Shide Zhou, Kailong Wang, Ling Shi, Haoyu Wang
Title: Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics
Abstract:
The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a unified framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.
中文摘要:本文提出一种通过分析层级激活模式的隐藏状态取证方法,能以超过95%准确率实时检测大语言模型异常行为,其计算开销极低,为提升AI系统安全性提供了有效解决方案。
English Summary: This paper introduces a hidden state forensics method that detects abnormal LLM behaviors with over 95% accuracy by analyzing layer-specific activation patterns, offering real-time threat detection with minimal computational overhead to enhance AI system security.

Authors:Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, Jianfeng Feng
Title: DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
Abstract:
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.
中文: DecoFuse提出了一种受大脑启发的框架,通过将fMRI信号分解为语义、空间和运动三个独立成分分别解码再融合,在视频重建任务中实现了卓越性能,并通过神经编码分析验证了其与双通路假设的生物一致性。
English: DecoFuse introduces a brain-inspired framework that decomposes fMRI signals into semantic, spatial, and motion components for separate decoding and fusion, achieving superior performance in video reconstruction and validating biological plausibility through neural alignment.

Authors:Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
Title: Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Abstract:
Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.
中文: SPORT提出了一种无需人工标注的迭代探索方法,通过自主优化使多模态智能体能够自主发现有效的工具使用策略,并在基准测试中展现出显著性能提升。
English: SPORT introduces an iterative exploration method that enables multimodal agents to autonomously develop effective tool usage strategies through self-optimization, eliminating the need for costly human annotations while demonstrating improved performance on benchmarks.

Authors:Mingkai Xu, Yongpeng Wu, Yuxuan Shi, Xiang-Gen Xia, Merouane Debbah, Wenjun Zhang, Ping Zhang
Title: Semantic-aided Parallel Image Transmission Compatible with Practical System
Abstract:
In this paper, we propose a novel semantic-aided image communication framework for supporting the compatibility with practical separation-based coding architectures. Particularly, the deep learning (DL)-based joint source-channel coding (JSCC) is integrated into the classical separate source-channel coding (SSCC) to transmit the images via the combination of semantic stream and image stream from DL networks and SSCC respectively, which we name as parallel-stream transmission. The positive coding gain stems from the sophisticated design of the JSCC encoder, which leverages the residual information neglected by the SSCC to enhance the learnable image features. Furthermore, a conditional rate adaptation mechanism is introduced to adjust the transmission rate of semantic stream according to residual, rendering the framework more flexible and efficient to bandwidth allocation. We also design a dynamic stream aggregation strategy at the receiver, which provides the composite framework with more robustness to signal-to-noise ratio (SNR) fluctuations in wireless systems compared to a single conventional codec. Finally, the proposed framework is verified to surpass the performance of both traditional and DL-based competitors in a large range of scenarios and meanwhile, maintains lightweight in terms of the transmission and computational complexity of semantic stream, which exhibits the potential to be applied in real systems.
中文: 本文提出了一种语义辅助的图像通信框架,通过将深度学习联合信源信道编码与传统分离编码相结合,利用并行流传输提升图像传输效率,并在不同信道条件下保持更强的鲁棒性。
English: This paper introduces a semantic-aided image communication framework that integrates deep learning-based joint source-channel coding with traditional separate coding, using parallel streams to enhance transmission efficiency and robustness across varying channel conditions.

Authors:Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, Mingxuan Yuan
Title: Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search
Abstract:
Using Large Language Models (LLMs) in an evolutionary or other iterative search framework have demonstrated significant potential in automated algorithm design. However, the underlying fitness landscape, which is critical for understanding its search behavior, remains underexplored. In this paper, we illustrate and analyze the fitness landscape of LLM-assisted Algorithm Search (LAS) using a graph-based approach, where nodes represent algorithms and edges denote transitions between them. We conduct extensive evaluations across six algorithm design tasks and six commonly-used LLMs. Our findings reveal that LAS landscapes are highly multimodal and rugged, particularly in combinatorial optimization tasks, with distinct structural variations across tasks and LLMs. Moreover, we adopt four different methods for algorithm similarity measurement and study their correlations to algorithm performance and operator behaviour. These insights not only deepen our understanding of LAS landscapes but also provide practical insights for designing more effective LAS methods.
中文: 本研究通过基于图的方法分析了大语言模型辅助算法搜索的适应度地形,揭示了其高度多模态且崎岖的结构特征在不同任务和模型间的显著差异,同时探讨了算法相似性与性能的关联,为优化搜索方法提供了重要见解。
English: This study analyzes the fitness landscape of LLM-assisted algorithm search using a graph-based approach, revealing highly multimodal and rugged structures that vary across tasks and models, while also examining algorithm similarity correlations to enhance future method design.

Authors:Hongyu Wang, Shuming Ma, Furu Wei
Title: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Abstract:
Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.
中文: BitNet v2 提出了一种新颖框架,通过 H-BitLinear 模块应用在线哈达玛变换将激活异常值平滑为类高斯分布,实现了 1 位大语言模型的原生 4 位激活量化,在保持性能的同时显著降低了内存占用和计算成本。
English: BitNet v2 introduces a novel framework with H-BitLinear, applying an online Hadamard transformation to smooth activation outliers into Gaussian-like distributions, enabling native 4-bit quantization for 1-bit LLMs while maintaining performance and reducing memory and computational costs.

Authors:Andreas Anastasiou, Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou
Title: Multiple Target Tracking Using a UAV Swarm in Maritime Environments
Abstract:
Nowadays, unmanned aerial vehicles (UAVs) are increasingly utilized in search and rescue missions, a trend driven by technological advancements, including enhancements in automation, avionics, and the reduced cost of electronics. In this work, we introduce a collaborative model predictive control (MPC) framework aimed at addressing the joint problem of guidance and state estimation for tracking multiple castaway targets with a fleet of autonomous UAV agents. We assume that each UAV agent is equipped with a camera sensor, which has a limited sensing range and is utilized for receiving noisy observations from multiple moving castaways adrift in maritime conditions. We derive a nonlinear mixed integer programming (NMIP) -based controller that facilitates the guidance of the UAVs by generating non-myopic trajectories within a receding planning horizon. These trajectories are designed to minimize the tracking error across multiple targets by directing the UAV fleet to locations expected to yield targets measurements, thereby minimizing the uncertainty of the estimated target states. Extensive simulation experiments validate the effectiveness of our proposed method in tracking multiple castaways in maritime environments.
中文: 本研究提出了一种协作模型预测控制框架,通过生成非近视轨迹引导无人机编队跟踪海上遇险者,以优化传感器定位来最小化跟踪误差和状态估计不确定性。
English: This study presents a collaborative MPC framework for UAV fleets to track multiple maritime castaways by generating non-myopic trajectories that minimize tracking error and state uncertainty through optimized sensor positioning.

Authors:Haotian Zhang, Yuqi Li, Li Li, Dong Liu
Title: Learning Switchable Priors for Neural Image Compression
Abstract:
Neural image compression (NIC) usually adopts a predefined family of probabilistic distributions as the prior of the latent variables, and meanwhile relies on entropy models to estimate the parameters for the probabilistic family. More complex probabilistic distributions may fit the latent variables more accurately, but also incur higher complexity of the entropy models, limiting their practical value. To address this dilemma, we propose a solution to decouple the entropy model complexity from the prior distributions. We use a finite set of trainable priors that correspond to samples of the parametric probabilistic distributions. We train the entropy model to predict the index of the appropriate prior within the set, rather than the specific parameters. Switching between the trained priors further enables us to embrace a skip mode into the prior set, which simply omits a latent variable during the entropy coding. To demonstrate the practical value of our solution, we present a lightweight NIC model, namely FastNIC, together with the learning of switchable priors. FastNIC obtains a better trade-off between compression efficiency and computational complexity for neural image compression. We also implanted the switchable priors into state-of-the-art NIC models and observed improved compression efficiency with a significant reduction of entropy coding complexity.
Chinese: 该方法通过使用有限的可训练先验集合并预测其索引,将熵模型复杂度与先验分布解耦,实现了跳过模式,从而在神经图像压缩中提升了压缩效率与计算复杂度之间的平衡。
English: The proposed method decouples entropy model complexity from prior distributions by using a finite set of trainable priors and predicting their indices, enabling a skip mode and improving the trade-off between compression efficiency and computational complexity in neural image compression.

Authors:Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Title: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Abstract:
As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.
中文:尽管在多语言基准上投入了大量资金,英语仍占主导地位,且本地化基准比翻译版本更能契合人类判断,这凸显了制定文化适配评估体系的必要性。
English: Despite substantial investments in multilingual benchmarks, English remains overrepresented, and localized benchmarks outperform translations in aligning with human judgments, highlighting the need for culturally tailored evaluations.

Authors:Jinghua Zhao, Yuhang Jia, Shiyao Wang, Jiaming Zhou, Hui Wang, Yong Qin
Title: Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
Abstract:
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/
Chinese: 该研究发布了中文-LiPS多模态数据集,包含100小时的唇读和幻灯片视觉信息,并提出了LiPS-AVSR方法,通过融合这两种视觉线索将语音识别性能提升了35%。
English: The study introduces Chinese-LiPS, a 100-hour multimodal dataset combining lip-reading and presentation slides, and proposes LiPS-AVSR, a pipeline that enhances ASR performance by 35% by integrating both visual cues.

Authors:Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong
Title: ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Abstract:
While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
中文摘要:ReTool通过实时代码执行与自动化强化学习的工具集成学习,显著提升了模型在复杂数学推理中的性能,其准确率和效率均优于纯文本基线及OpenAI o1-preview等先进模型。
English Summary: ReTool enhances reasoning models by integrating real-time code execution and automated reinforcement learning, achieving superior accuracy and efficiency in complex mathematical problem-solving compared to text-based baselines and advanced models like OpenAI's o1-preview.

Authors:Nicolas Baumann, Cheng Hu, Paviththiren Sivasothilingam, Haotong Qin, Lei Xie, Michele Magno, Luca Benini
Title: Enhancing Autonomous Driving Systems with On-Board Deployed Large Language Models
Abstract:
Neural Networks (NNs) trained through supervised learning struggle with managing edge-case scenarios common in real-world driving due to the intractability of exhaustive datasets covering all edge-cases, making knowledge-driven approaches, akin to how humans intuitively detect unexpected driving behavior, a suitable complement to data-driven methods. This work proposes a hybrid architecture combining low-level Model Predictive Controller (MPC) with locally deployed Large Language Models (LLMs) to enhance decision-making and Human Machine Interaction (HMI). The DecisionxLLM module evaluates robotic state information against natural language instructions to ensure adherence to desired driving behavior. The MPCxLLM module then adjusts MPC parameters based on LLM-generated insights, achieving control adaptability while preserving the safety and constraint guarantees of traditional MPC systems. Further, to enable efficient on-board deployment and to eliminate dependency on cloud connectivity, we shift processing to the on-board computing platform: We propose an approach that exploits Retrieval Augmented Generation (RAG), Low Rank Adaptation (LoRA) fine-tuning, and quantization. Experimental results demonstrate that these enhancements yield significant improvements in reasoning accuracy by up to 10.45%, control adaptability by as much as 52.2%, and up to 10.5x increase in computational efficiency (tokens/s), validating the proposed framework's practicality for real-time deployment even on down-scaled robotic platforms. This work bridges high-level decision-making with low-level control adaptability, offering a synergistic framework for knowledge-driven and adaptive Autonomous Driving Systems (ADS).
中文摘要:本研究提出了一种结合大语言模型与模型预测控制的混合架构,通过增强决策能力和控制适应性,显著提升了自动驾驶系统的推理准确性、控制灵活性和计算效率,适用于实时部署场景。
English Summary: This study introduces a hybrid framework integrating Large Language Models with Model Predictive Control to enhance autonomous driving decision-making and adaptability, achieving significant improvements in reasoning accuracy, control flexibility, and computational efficiency for real-time deployment.

Authors:Jiani Liu, Zhiyuan Wang, Zeliang Zhang, Chao Huang, Susan Liang, Yunlong Tang, Chenliang Xu
Title: Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability
Abstract:
Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.
中文: 视觉变换器因其计算冗余展现出高对抗迁移性,所提出的利用该冗余的技术显著提升了跨模型架构的攻击效果。
English: Vision Transformers exhibit high adversarial transferability due to computational redundancy, and the proposed techniques leveraging this redundancy significantly enhance attack effectiveness across model architectures.

Authors:Wei Tao, Xiaoyang Qu, Kai Lu, Jiguang Wan, Guokuan Li, Jianzong Wang
Title: MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs
Abstract:
When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.
中文:本文提出MADLLM方法,通过三重编码技术将多元时间序列与大型语言模型的文本模态对齐,在异常检测任务中超越了现有最优方法。
English: This paper introduces MADLLM, a novel method that uses triple encoding to align multivariate time series with LLMs' text modality, outperforming existing approaches on anomaly detection tasks.

Authors:Xin Tan, Yuzhou Ji, He Zhu, Yuan Xie
Title: FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents
Abstract:
The semantically interactive radiance field has long been a promising backbone for 3D real-world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi-granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part-level open-vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object- and part-level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine-grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state-of-the-art methods prove that our method can not only better locate specified part-level targets, but also achieve first-place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.
中文摘要:FMLGS提出了一种基于3D高斯泼溅的部件级开放词汇查询方法,通过语义特征插值解决语言歧义问题,在精确定位物体部件的同时实现了98倍的速度提升,并能通过聊天界面进行交互式三维场景导航。
English Summary: FMLGS introduces a part-level open-vocabulary query method for 3D Gaussian Splatting that resolves language ambiguity through semantic feature interpolation, achieving superior speed and accuracy in locating object components while enabling interactive 3D scene navigation.

Authors:Tobias Pfandzelter, Nikita Bauer, Alexander Leis, Corentin Perdrizet, Felix Trautwein, Trever Schirmer, Osama Abboud, David Bermbach
Title: Trabant: A Serverless Architecture for Multi-Tenant Orbital Edge Computing
Abstract:
Orbital edge computing reduces the data transmission needs of Earth observation satellites by processing sensor data on-board, allowing near-real-time insights while minimizing downlink costs. However, current orbital edge computing architectures are inflexible, requiring custom mission planning and high upfront development costs. In this paper, we propose a novel approach: shared Earth observation satellites that are operated by a central provider but used by multiple tenants. Each tenant can execute their own logic on-board the satellite to filter, prioritize, and analyze sensor data. We introduce Trabant, a serverless architecture for shared satellite platforms, leveraging the Function-as-a-Service (FaaS) paradigm and time-shifted computing. This architecture abstracts operational complexities, enabling dynamic scheduling under satellite resource constraints, reducing deployment overhead, and aligning event-driven satellite observations with intermittent computation. We present the design of Trabant, demonstrate its capabilities with a proof-of-concept prototype, and evaluate it using real satellite computing telemetry data. Our findings suggest that Trabant can significantly reduce mission planning overheads, offering a scalable and efficient platform for diverse Earth observation missions.
中文: 轨道边缘计算通过在卫星上处理数据以减少传输需求,而Trabant架构提出了一种基于无服务器计算的共享卫星平台,允许多租户动态执行定制逻辑,显著降低了任务规划开销并提升了可扩展性。
English: Orbital edge computing enables real-time data processing on satellites to reduce transmission costs, and the proposed Trabant architecture introduces a serverless, shared platform that allows multiple users to run custom functions efficiently, lowering deployment overhead and improving scalability.

Authors:Bingyan Xie, Yongpeng Wu, Feng Shu, Jiangzhou Wang, Wenjun Zhang
Title: Multi-user Wireless Image Semantic Transmission over MIMO Multiple Access Channels
Abstract:
This paper focuses on a typical uplink transmission scenario over multiple-input multiple-output multiple access channel (MIMO-MAC) and thus propose a multi-user learnable CSI fusion semantic communication (MU-LCFSC) framework. It incorporates CSI as the side information into both the semantic encoders and decoders to generate a proper feature mask map in order to produce a more robust attention weight distribution. Especially for the decoding end, a cooperative successive interference cancellation procedure is conducted along with a cooperative mask ratio generator, which flexibly controls the mask elements of feature mask maps. Numerical results verify the superiority of proposed MU-LCFSC compared to DeepJSCC-NOMA over 3 dB in terms of PSNR.
中文: 本文提出了一种多用户可学习CSI融合语义通信框架,将信道状态信息融入编解码器以通过特征掩码映射和干扰消除提升鲁棒性,在PSNR指标上优于DeepJSCC-NOMA超过3分贝。
English: This paper introduces a multi-user learnable CSI fusion semantic communication framework that integrates channel state information into encoders and decoders to enhance robustness through feature mask mapping and interference cancellation, outperforming DeepJSCC-NOMA by over 3 dB in PSNR.

Authors:Junyuan Gao, Shuao Chen, Yongpeng Wu, Liang Liu, Giuseppe Caire, H. Vincent Poor, Wenjun Zhang
Title: Finite-Blocklength Information Theory
Abstract:
Traditional asymptotic information-theoretic studies of the fundamental limits of wireless communication systems primarily rely on some ideal assumptions, such as infinite blocklength and vanishing error probability. While these assumptions enable tractable mathematical characterizations, they fail to capture the stringent requirements of some emerging next-generation wireless applications, such as ultra-reliable low latency communication and ultra-massive machine type communication, in which it is required to support a much wider range of features including short-packet communication, extremely low latency, and/or low energy consumption. To better support such applications, it is important to consider finite-blocklength information theory. In this paper, we present a comprehensive review of the advances in this field, followed by a discussion on the open questions. Specifically, we commence with the fundamental limits of source coding in the non-asymptotic regime, with a particular focus on lossless and lossy compression in point-to-point~(P2P) and multiterminal cases. Next, we discuss the fundamental limits of channel coding in P2P channels, multiple access channels, and emerging massive access channels. We further introduce recent advances in joint source and channel coding, highlighting its considerable performance advantage over separate source and channel coding in the non-asymptotic regime. In each part, we review various non-asymptotic achievability bounds, converse bounds, and approximations, as well as key ideas behind them, which are essential for providing engineering insights into the design of future wireless communication systems.
中文: 本文综述了有限码长信息理论的研究进展,针对传统渐近方法的局限性,深入探讨了信源与信道编码在非渐近域中的基本极限,以支持新一代无线通信应用的严苛需求。
English: This paper reviews finite-blocklength information theory, addressing the limitations of traditional asymptotic approaches and exploring fundamental limits in source and channel coding to meet the stringent requirements of next-generation wireless applications.

Authors:Aly M. Kassem, Bernhard Schölkopf, Zhijing Jin
Title: How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Abstract:
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited. They largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types, including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking. Additionally, it integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions. For instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM even when simpler models would suffice, while routing jailbreaking attempts to weaker models, thereby elevating safety risks.
中文摘要:大语言模型路由通过动态分配查询来平衡计算成本与性能,但现有评估基准存在局限,为此提出DSC基准框架,全面评估各类查询任务并揭示隐私安全等潜在风险。
English Summary: Large language model routing optimizes cost and performance by directing queries to suitable models, but current benchmarks lack comprehensive evaluation of task-specific behaviors and security risks, prompting the introduction of the DSC benchmark to assess diverse query categories and uncover hidden vulnerabilities.

Authors:Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Zhen-Hua Ling
Title: A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication
Abstract:
This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications.
中文: StreamCodec是一种采用因果编码器-解码器结构和残差标量矢量量化的低延迟神经音频编解码器,在极低码率下实现高音质,CPU运算速度接近实时通信需求。
English: StreamCodec is a low-latency neural audio codec featuring a causal encoder-decoder and residual scalar-vector quantization, achieving near real-time CPU performance with high audio quality at minimal bandwidth.

Authors:Duanyang Yuan, Sihang Zhou, Xiaoshu Chen, Dong Wang, Ke Liang, Xinwang Liu, Jian Huang
Title: Knowledge Graph Completion with Relation-Aware Anchor Enhancement
Abstract:
Text-based knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide a reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications.
中文: 本文提出RAA-KGC方法,通过将头实体的关系感知邻居作为上下文锚点来优化查询嵌入表示,无需大幅改动结构即可显著提升知识图谱链接预测性能。
English: This paper introduces RAA-KGC, a knowledge graph completion method that leverages relation-aware neighbors of head entities as contextual anchors to refine query embeddings, significantly improving link prediction accuracy without major structural changes.

Authors:Savvas Papaioannou, Panayiotis Kolios, Theocharis Theocharides, Christos G. Panayiotou, Marios M. Polycarpou
Title: Jointly-optimized Trajectory Generation and Camera Control for 3D Coverage Planning
Abstract:
This work proposes a jointly optimized trajectory generation and camera control approach, enabling an autonomous agent, such as an unmanned aerial vehicle (UAV) operating in 3D environments, to plan and execute coverage trajectories that maximally cover the surface area of a 3D object of interest. Specifically, the UAV's kinematic and camera control inputs are jointly optimized over a rolling planning horizon to achieve complete 3D coverage of the object. The proposed controller incorporates ray-tracing into the planning process to simulate the propagation of light rays, thereby determining the visible parts of the object through the UAV's camera. This integration enables the generation of precise look-ahead coverage trajectories. The coverage planning problem is formulated as a rolling finite-horizon optimal control problem and solved using mixed-integer programming techniques. Extensive real-world and synthetic experiments validate the performance of the proposed approach.
中文: 本研究提出一种轨迹生成与相机控制的联合优化方法,通过集成光线追踪的滚动规划与混合整数规划,使无人机能够高效实现目标物体的完整三维覆盖。
English: This study introduces a joint optimization method for trajectory generation and camera control that enables UAVs to efficiently achieve complete 3D object coverage through rolling-horizon planning with integrated ray-tracing and mixed-integer programming.

Authors:Savvas Papaioannou, Panayiotis Kolios, Theocharis Theocharides, Christos G. Panayiotou, Marios M. Polycarpou
Title: Rolling Horizon Coverage Control with Collaborative Autonomous Agents
Abstract:
This work proposes a coverage controller that enables an aerial team of distributed autonomous agents to collaboratively generate non-myopic coverage plans over a rolling finite horizon, aiming to cover specific points on the surface area of a 3D object of interest. The collaborative coverage problem, formulated, as a distributed model predictive control problem, optimizes the agents' motion and camera control inputs, while considering inter-agent constraints aiming at reducing work redundancy. The proposed coverage controller integrates constraints based on light-path propagation techniques to predict the parts of the object's surface that are visible with regard to the agents' future anticipated states. This work also demonstrates how complex, non-linear visibility assessment constraints can be converted into logical expressions that are embedded as binary constraints into a mixed-integer optimization framework. The proposed approach has been demonstrated through simulations and practical applications for inspecting buildings with unmanned aerial vehicles (UAVs).
中文摘要:本研究开发了一种分布式覆盖控制器,使空中智能体团队能够协作生成针对三维物体的非近视覆盖规划,通过将可见性约束和二元逻辑表达式整合到混合整数优化框架中,并在无人机建筑检测的仿真与实际应用中验证了其有效性。
English Summary: This study develops a distributed coverage controller for aerial teams to collaboratively generate non-myopic coverage plans for 3D objects, integrating visibility constraints and binary logical expressions into a mixed-integer optimization framework, validated through UAV building inspection simulations and applications.

Authors:Rui Mao, Yongpeng Wu, Boxiao Shen, Symeon Chatzinotas, Björn Ottersten, Wenjun Zhang
Title: Grant-Free Random Access in Uplink LEO Satellite Communications with OFDM
Abstract:
This paper investigates joint device activity detection and channel estimation for grant-free random access in Low-earth orbit (LEO) satellite communications. We consider uplink communications from multiple single-antenna terrestrial users to a LEO satellite equipped with a uniform planar array of multiple antennas, where orthogonal frequency division multiplexing (OFDM) modulation is adopted. To combat the severe Doppler shift, a transmission scheme is proposed, where the discrete prolate spheroidal basis expansion model (DPS-BEM) is introduced to reduce the number of unknown channel parameters. Then the vector approximate message passing (VAMP) algorithm is employed to approximate the minimum mean square error estimation of the channel, and the Markov random field is combined to capture the channel sparsity. Meanwhile, the expectation-maximization (EM) approach is integrated to learn the hyperparameters in priors. Finally, active devices are detected by calculating energy of the estimated channel. Simulation results demonstrate that the proposed method outperforms conventional algorithms in terms of activity error rate and channel estimation precision.
中文摘要:本文提出了一种基于向量近似消息传递算法和离散扁球面基扩展模型的低轨卫星通信联合设备活动检测与信道估计方法,有效克服多普勒频移,其性能优于传统算法。
English Summary: This paper proposes a method for joint device activity detection and channel estimation in LEO satellite communications using VAMP algorithm with DPS-BEM to mitigate Doppler effects, showing superior performance over conventional approaches.

Authors:Zhiyu He, Zhixin Ling, Jiayu Li, Zhiqiang Guo, Weizhi Ma, Xinchen Luo, Min Zhang, Guorui Zhou
Title: Short Video Segment-level User Dynamic Interests Modeling in Personalized Recommendation
Abstract:
The rapid growth of short videos has necessitated effective recommender systems to match users with content tailored to their evolving preferences. Current video recommendation models primarily treat each video as a whole, overlooking the dynamic nature of user preferences with specific video segments. In contrast, our research focuses on segment-level user interest modeling, which is crucial for understanding how users' preferences evolve during video browsing. To capture users' dynamic segment interests, we propose an innovative model that integrates a hybrid representation module, a multi-modal user-video encoder, and a segment interest decoder. Our model addresses the challenges of capturing dynamic interest patterns, missing segment-level labels, and fusing different modalities, achieving precise segment-level interest prediction. We present two downstream tasks to evaluate the effectiveness of our segment interest modeling approach: video-skip prediction and short video recommendation. Our experiments on real-world short video datasets with diverse modalities show promising results on both tasks. It demonstrates that segment-level interest modeling brings a deep understanding of user engagement and enhances video recommendations. We also release a unique dataset that includes segment-level video data and diverse user behaviors, enabling further research in segment-level interest modeling. This work pioneers a novel perspective on understanding user segment-level preference, offering the potential for more personalized and engaging short video experiences.
中文摘要:本研究提出一种创新的短视频分段兴趣建模方法,通过混合表征模块和多模态编码器捕捉用户对视频片段的动态偏好,显著提升了视频跳过预测和内容推荐的精准度。
English Summary: This study introduces a novel model for segment-level user interest modeling in short videos, utilizing a hybrid representation module and multi-modal encoder to enhance recommendation accuracy by capturing dynamic preferences within video segments.

Authors:Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, Di Wang
Title: Understanding Aha Moments: from External Observations to Internal Mechanisms
Abstract:
Large Reasoning Models (LRMs), capable of reasoning through complex problems, have become crucial for tasks like programming, mathematics, and commonsense reasoning. However, a key challenge lies in understanding how these models acquire reasoning capabilities and exhibit "aha moments" when they reorganize their methods to allocate more thinking time to problems. In this work, we systematically study "aha moments" in LRMs, from linguistic patterns, description of uncertainty, "Reasoning Collapse" to analysis in latent space. We demonstrate that the "aha moment" is externally manifested in a more frequent use of anthropomorphic tones for self-reflection and an adaptive adjustment of uncertainty based on problem difficulty. This process helps the model complete reasoning without succumbing to "Reasoning Collapse". Internally, it corresponds to a separation between anthropomorphic characteristics and pure reasoning, with an increased anthropomorphic tone for more difficult problems. Furthermore, we find that the "aha moment" helps models solve complex problems by altering their perception of problem difficulty. As the layer of the model increases, simpler problems tend to be perceived as more complex, while more difficult problems appear simpler.
中文摘要:研究表明,大型推理模型通过增强拟人化自我反思和自适应不确定性调整来展现“顿悟时刻”,这能避免推理崩溃,并在不同模型层中改变其对问题难度的感知。
English Summary: The study reveals that Large Reasoning Models exhibit "aha moments" through increased anthropomorphic self-reflection and adaptive uncertainty adjustment, which prevents Reasoning Collapse and alters their perception of problem difficulty across model layers.

Authors:Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang
Title: Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Abstract:
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
Chinese: 研究发现预训练数据中的语码转换是大型语言模型具备多语言能力的关键,而引入合成的语码转换数据能显著提升不同语言间的对齐效果。
English: The study finds that code-switching in pre-training data is crucial for the multilingual capabilities of large language models, and incorporating synthetic code-switching significantly enhances language alignment across diverse languages.

Authors:Hang Li, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
Title: LLM-VPRF: Large Language Model Based Vector Pseudo Relevance Feedback
Abstract:
Vector Pseudo Relevance Feedback (VPRF) has shown promising results in improving BERT-based dense retrieval systems through iterative refinement of query representations. This paper investigates the generalizability of VPRF to Large Language Model (LLM) based dense retrievers. We introduce LLM-VPRF and evaluate its effectiveness across multiple benchmark datasets, analyzing how different LLMs impact the feedback mechanism. Our results demonstrate that VPRF's benefits successfully extend to LLM architectures, establishing it as a robust technique for enhancing dense retrieval performance regardless of the underlying models. This work bridges the gap between VPRF with traditional BERT-based dense retrievers and modern LLMs, while providing insights into their future directions.
中文:VPRF成功扩展到基于大语言模型的密集检索系统,LLM-VPRF在多个基准测试中验证了其有效性,弥合了传统BERT模型与现代大语言模型之间的技术鸿沟。
English: VPRF effectively enhances dense retrieval performance in LLM-based systems, as demonstrated by the successful application of LLM-VPRF across various benchmarks, bridging the gap between traditional BERT models and modern LLMs.

Authors:Mingqian Feng, Zeliang Zhang, Jinyang Jiang, Yijie Peng, Chenliang Xu
Title: Forward Learning with Differential Privacy
Abstract:
Differential privacy (DP) in deep learning is a critical concern as it ensures the confidentiality of training data while maintaining model utility. Existing DP training algorithms provide privacy guarantees by clipping and then injecting external noise into sample gradients computed by the backpropagation algorithm. Different from backpropagation, forward-learning algorithms based on perturbation inherently add noise during the forward pass and utilize randomness to estimate the gradients. Although these algorithms are non-privatized, the introduction of noise during the forward pass indirectly provides internal randomness protection to the model parameters and their gradients, suggesting the potential for naturally providing differential privacy. In this paper, we propose a \blue{privatized} forward-learning algorithm, Differential Private Unified Likelihood Ratio (DP-ULR), and demonstrate its differential privacy guarantees. DP-ULR features a novel batch sampling operation with rejection, of which we provide theoretical analysis in conjunction with classic differential privacy mechanisms. DP-ULR is also underpinned by a theoretically guided privacy controller that dynamically adjusts noise levels to manage privacy costs in each training step. Our experiments indicate that DP-ULR achieves competitive performance compared to traditional differential privacy training algorithms based on backpropagation, maintaining nearly the same privacy loss limits.
中文: 本文提出DP-ULR算法,通过带拒绝的批次采样和动态噪声调节实现差分隐私保护,在保持与传统反向传播算法相近隐私损耗的同时,获得了具有竞争力的模型性能。
English: The paper introduces DP-ULR, a privatized forward-learning algorithm that uses batch sampling with rejection and dynamic noise adjustment to provide differential privacy while maintaining competitive model performance compared to traditional backpropagation-based methods.

Authors:Pau Colomer, Christian Deppe, Holger Boche, Andreas Winter
Title: Quantum Hypothesis Testing Lemma for Deterministic Identification over Quantum Channels
Abstract:
In our previous work, we presented the \emph{Hypothesis Testing Lemma}, a key tool that establishes sufficient conditions for the existence of good deterministic identification (DI) codes for memoryless channels with finite output, but arbitrary input alphabets. In this work, we provide a full quantum analogue of this lemma, which shows that the existence of a DI code in the quantum setting follows from a suitable packing in a modified space of output quantum states. Specifically, we demonstrate that such a code can be constructed using product states derived from this packing. This result enables us to tighten the capacity lower bound for DI over quantum channels beyond the simultaneous decoding approach. In particular, we can now express these bounds solely in terms of the Minkowski dimension of a certain state space, giving us new insights to better understand the nature of the protocol, and the separation between simultaneous and non-simultaneous codes. We extend the discussion with a particular channel example for which we can construct an optimum code.
中文: 本研究将假设检验引理推广至量子领域,证明通过修正输出空间的积态可构建确定性识别码,从而收紧容量下界并通过闵可夫斯基维度分析揭示协议本质。
English: This work extends the Hypothesis Testing Lemma to the quantum domain, establishing that deterministic identification codes can be constructed using product states from a modified output space, thereby tightening capacity bounds and revealing insights through Minkowski dimension analysis.

Authors:Yaning Zhao, Pau Colomer, Holger Boche, Christian Deppe
Title: Identification over Poisson ISI Channels: Feedback and Molecular Applications
Abstract:
Molecular communication (MC) enables information transfer via molecules, making it ideal for biomedical applications where traditional methods fall short. In many such scenarios, identifying specific events is more critical than decoding full messages, motivating the use of deterministic identification (DI). This paper investigates DI over discrete-time Poisson channels (DTPCs) with inter-symbol interference (ISI), a realistic setting due to channel memory effects. We improve the known upper bound on DI capacity under power constraints from $\frac{3}{2} + κ$ to $\frac{1 + κ}{2}$. Additionally, we present the first results on deterministic identification with feedback (DIF) in this context, providing a constructive lower bound. These findings enhance the theoretical understanding of MC and support more efficient, feedback-driven biomedical systems.
Chinese: 本文通过改进具有符号间干扰的离散时间泊松信道的确定性识别容量界限,并首次引入基于反馈的识别结果,推动了分子通信的发展,为更高效的生物医学应用提供了支持。
English: This paper advances molecular communication by improving the deterministic identification capacity bound for discrete-time Poisson channels with interference and introducing the first feedback-based identification results, supporting more efficient biomedical applications.

Authors:Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Xupeng Miao, Bin Cui
Title: Hetu v2: A General and Scalable Deep Learning System with Hierarchical and Heterogeneous Single Program Multiple Data Annotations
Abstract:
The Single Program Multiple Data (SPMD) paradigm provides a unified abstraction to annotate various parallel dimensions in distributed deep learning (DL) training. With SPMD, users can write training programs from the viewpoint of a single device, and the system will automatically deduce the tensor sharding and communication patterns. However, with the recent development in large-scale DL models, distributed training exhibits spatial and temporal workload heterogeneity, arising from both device disparities (e.g., mixed hardware, failures) and data variations (e.g., uneven sequence lengths). Such heterogeneity violates SPMD's assumption of uniform workload partitioning, which restricts its ability to express and optimize heterogeneous parallel strategies effectively. To address this, we propose HSPMD within the Hetu v2 system to achieve general and scalable DL training. HSPMD extends SPMD's annotations to support asymmetric sharding and composes standard communication primitives for hierarchical communication, all while retaining the simplicity of a single-device declarative programming model. Leveraging HSPMD, Hetu handles spatial heterogeneity through progressive graph specialization, enabling device-specific execution logic, and addresses temporal heterogeneity via dynamic graph switching. Evaluations on heterogeneous clusters, elastic training, and mixed-length data scenarios show that HSPMD matches or outperforms specialized systems, providing a flexible and efficient solution for modern large-scale model training.
中文摘要:HSPMD扩展了SPMD范式,通过支持非对称分片和分层通信来处理分布式深度学习中的空间和时间工作负载异构性,在Hetu v2系统中实现了对多样化硬件和数据场景的高效训练。
English Summary: HSPMD extends the SPMD paradigm to handle spatial and temporal workload heterogeneity in distributed deep learning by supporting asymmetric sharding and hierarchical communication, enabling efficient training on diverse hardware and data scenarios within the Hetu v2 system.

Authors:Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
Title: Prompt Injection Attack to Tool Selection in LLM Agents
Abstract:
Tool selection is a key component of LLM agents. A popular approach follows a two-step process - \emph{retrieval} and \emph{selection} - to pick the most appropriate tool from a tool library for a given task. In this work, we introduce \textit{ToolHijacker}, a novel prompt injection attack targeting tool selection in no-box scenarios. ToolHijacker injects a malicious tool document into the tool library to manipulate the LLM agent's tool selection process, compelling it to consistently choose the attacker's malicious tool for an attacker-chosen target task. Specifically, we formulate the crafting of such tool documents as an optimization problem and propose a two-phase optimization strategy to solve it. Our extensive experimental evaluation shows that ToolHijacker is highly effective, significantly outperforming existing manual-based and automated prompt injection attacks when applied to tool selection. Moreover, we explore various defenses, including prevention-based defenses (StruQ and SecAlign) and detection-based defenses (known-answer detection, DataSentinel, perplexity detection, and perplexity windowed detection). Our experimental results indicate that these defenses are insufficient, highlighting the urgent need for developing new defense strategies.
中文摘要:本文提出ToolHijacker这一新型提示注入攻击,通过注入恶意工具文档来操控LLM代理的工具选择过程,实验证明该攻击方法高效且现有防御措施均不足以应对。
English summary: This paper introduces ToolHijacker, a novel prompt injection attack that manipulates LLM agents' tool selection by injecting malicious tool documents, and demonstrates its effectiveness while showing existing defenses remain inadequate.

Authors:Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao
Title: DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering
Abstract:
Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.
Chinese: DualRAG是一个双过程框架,通过推理增强查询和渐进知识聚合协同工作,在多跳问答任务中显著提高了答案的准确性和连贯性。
English: DualRAG is a dual-process framework that synergistically integrates reasoning and retrieval through Reasoning-augmented Querying and progressive Knowledge Aggregation, significantly enhancing answer accuracy and coherence in multi-hop question answering tasks.

Authors:Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao
Title: DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering
Abstract:
Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.
Chinese: DualRAG是一个双过程框架,通过推理增强查询和渐进知识聚合协同工作,在多跳问答任务中显著提高了答案的准确性和连贯性。
English: DualRAG is a dual-process framework that synergistically integrates reasoning and retrieval through Reasoning-augmented Querying and progressive Knowledge Aggregation, significantly enhancing answer accuracy and coherence in multi-hop question answering tasks.

Authors:Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu
Title: Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey
Abstract:
Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.
中文: 本文对检索增强生成(RAG)评估方法进行了最全面的综述,系统梳理了系统性能、事实准确性和安全性的评估方法,并弥合了传统评估与LLM驱动评估之间的实践差异。
English: This paper presents the most comprehensive survey of Retrieval-Augmented Generation (RAG) evaluation methods, systematically reviewing approaches for assessing system performance, factual accuracy, and safety while bridging traditional and LLM-driven evaluation practices.

Authors:Avaneesh Devkota, Rachmad Vidya Wicaksana Putra, Muhammad Shafique
Title: SwitchMT: An Adaptive Context Switching Methodology for Scalable Multi-Task Learning in Intelligent Autonomous Agents
Abstract:
The ability to train intelligent autonomous agents (such as mobile robots) on multiple tasks is crucial for adapting to dynamic real-world environments. However, state-of-the-art reinforcement learning (RL) methods only excel in single-task settings, and still struggle to generalize across multiple tasks due to task interference. Moreover, real-world environments also demand the agents to have data stream processing capabilities. Toward this, a state-of-the-art work employs Spiking Neural Networks (SNNs) to improve multi-task learning by exploiting temporal information in data stream, while enabling lowpower/energy event-based operations. However, it relies on fixed context/task-switching intervals during its training, hence limiting the scalability and effectiveness of multi-task learning. To address these limitations, we propose SwitchMT, a novel adaptive task-switching methodology for RL-based multi-task learning in autonomous agents. Specifically, SwitchMT employs the following key ideas: (1) a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves superior performance in multi-task learning compared to state-of-the-art methods. It achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) compared to the state-of-the-art, showing its better generalized learning capability. These results highlight the effectiveness of our SwitchMT methodology in addressing task interference while enabling multi-task learning automation through adaptive task switching, thereby paving the way for more efficient generalist agents with scalable multi-task learning capabilities.
中文: SwitchMT提出了一种自适应任务切换方法,通过深度脉冲Q网络和动态策略解决多任务学习中的任务干扰问题,在Atari游戏中表现优异,提升了自主智能体的可扩展多任务学习能力。
English: SwitchMT introduces an adaptive task-switching method using a Deep Spiking Q-Network and dynamic policy to overcome multi-task learning limitations, achieving superior performance in Atari games and enhancing scalability for autonomous agents.

Authors:Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
Title: The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Abstract:
The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.
中文: RAPO是一种新颖的检索增强提示优化框架,通过双重优化分支改进用户提示,使其与训练提示分布对齐,从而提升生成视频的静态与动态质量。
English: RAPO is a novel retrieval-augmented prompt optimization framework that refines user prompts through dual optimization branches to enhance video generation quality by aligning them with training prompt distributions.

Authors:Yanbo Wang, Jiyang Guan, Jian Liang, Ran He
Title: Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Abstract:
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability, while the specific knowledge or patterns acquired from the high-quality dataset remain unclear. Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences. Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction finetuning. Simply correcting the underlying data bias could narrow the safety gap in the vision domain.
Chinese: 多模态大语言模型存在安全对齐差距,主要源于数据分布偏差,通过使用包含简单拒绝回复的少量良性数据进行微调,无需大量恶意数据收集,即可显著缩小视觉领域的安全差距。
English: Multi-modal large language models face a safety alignment gap due to data distribution biases, which can be effectively narrowed by fine-tuning with a small set of benign data containing simple rejection responses, without requiring extensive malicious data collection.

Authors:Michael Kölle, Alexander Feist, Jonas Stein, Sebastian Wölckert, Claudia Linnhoff-Popien
Title: Evaluating Parameter-Based Training Performance of Neural Networks and Variational Quantum Circuits
Abstract:
In recent years, neural networks (NNs) have driven significant advances in machine learning. However, as tasks grow more complex, NNs often require large numbers of trainable parameters, which increases computational and energy demands. Variational quantum circuits (VQCs) offer a promising alternative: they leverage quantum mechanics to capture intricate relationships and typically need fewer parameters. In this work, we evaluate NNs and VQCs on simple supervised and reinforcement learning tasks, examining models with different parameter sizes. We simulate VQCs and execute selected parts of the training process on real quantum hardware to approximate actual training times. Our results show that VQCs can match NNs in performance while using significantly fewer parameters, despite longer training durations. As quantum technology and algorithms advance, and VQC architectures improve, we posit that VQCs could become advantageous for certain machine learning tasks.
中文: 变分量子电路(VQC)在参数远少于神经网络(NN)的情况下仍能实现相当的性能,尽管训练时间较长,但展现了其在未来机器学习应用中的潜力。
English: Variational quantum circuits (VQCs) demonstrate comparable performance to neural networks (NNs) with substantially fewer parameters, suggesting their potential for future machine learning applications despite longer training times.

Authors:Michael Kölle, Tom Bintener, Maximilian Zorn, Gerhard Stenzel, Leo Sünkel, Thomas Gabor, Claudia Linnhoff-Popien
Title: Evaluating Mutation Techniques in Genetic Algorithm-Based Quantum Circuit Synthesis
Abstract:
Quantum computing leverages the unique properties of qubits and quantum parallelism to solve problems intractable for classical systems, offering unparalleled computational potential. However, the optimization of quantum circuits remains critical, especially for noisy intermediate-scale quantum (NISQ) devices with limited qubits and high error rates. Genetic algorithms (GAs) provide a promising approach for efficient quantum circuit synthesis by automating optimization tasks. This work examines the impact of various mutation strategies within a GA framework for quantum circuit synthesis. By analyzing how different mutations transform circuits, it identifies strategies that enhance efficiency and performance. Experiments utilized a fitness function emphasizing fidelity, while accounting for circuit depth and T operations, to optimize circuits with four to six qubits. Comprehensive hyperparameter testing revealed that combining delete and swap strategies outperformed other approaches, demonstrating their effectiveness in developing robust GA-based quantum circuit optimizers.
中文: 本研究证明,在遗传算法框架中结合删除与交换变异策略,能显著提升NISQ设备量子电路合成的效率与性能,通过全面超参数测试验证其优越性。
English: This study demonstrates that combining delete and swap mutation strategies in a genetic algorithm framework significantly enhances the efficiency and performance of quantum circuit synthesis for NISQ devices, outperforming other approaches through comprehensive hyperparameter testing.

Authors:Leonardo Ranaldi, Federico Ranaldi, Fabio Massimo Zanzotto, Barry Haddow, Alexandra Birch
Title: Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations
Abstract:
Retrieval-augmented generation (RAG) is key to enhancing large language models (LLMs) to systematically access richer factual knowledge. Yet, using RAG brings intrinsic challenges, as LLMs must deal with potentially conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, DRAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. Through a series of in-depth experiments, we show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. The final results demonstrate that DRAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.
中文: 辩证检索增强生成(DRAG)通过论证性解释来批判性评估和解决多语言来源中的知识冲突,以较低计算成本显著提升回答准确性。
English: Dialectic-RAG (DRAG) enhances retrieval-augmented generation by using argumentative explanations to critically evaluate and resolve conflicting knowledge from multilingual sources, improving response accuracy with minimal computational cost.

Authors:Leonardo Ranaldi, Barry Haddow, Alexandra Birch
Title: Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
Abstract:
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.
中文: 本研究探讨了多语言检索增强生成在开放域问答中的应用,发现基于翻译的方法存在局限,并提出CrossRAG通过翻译检索文档来提升多语言任务性能。
English: This study explores multilingual retrieval-augmented generation (RAG) for open-domain question-answering, revealing limitations in translation-based methods and proposing CrossRAG, which translates retrieved documents to improve performance across languages.

Authors:Leonardo Ranaldi, Barry Haddow, Alexandra Birch
Title: Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
Abstract:
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.
中文: 本研究探讨了多语言检索增强生成在开放域问答中的应用,发现基于翻译的方法存在局限,并提出CrossRAG通过翻译检索文档来提升多语言任务性能。
English: This study explores multilingual retrieval-augmented generation (RAG) for open-domain question-answering, revealing limitations in translation-based methods and proposing CrossRAG, which translates retrieved documents to improve performance across languages.

Authors:Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu
Title: Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
Abstract:
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.
Chinese: 该摘要概述了将大语言模型的推理能力扩展到多模态领域的进展与挑战,强调需要复杂算法和评估方法来有效整合视觉与文本输入,并为未来研究指明了方向。
English: This abstract outlines the advancements and challenges in extending large language models' reasoning capabilities to multimodal contexts, emphasizing the need for sophisticated algorithms and evaluation methods to integrate visual and textual inputs effectively.

Authors:Andrea E Davidson, Jessica M Ray, Yulia Levites Strekalova, Parisa Rashidi, Azra Bihorac
Title: Human-Centered Development of an Explainable AI Framework for Real-Time Surgical Risk Surveillance
Abstract:
Background: Artificial Intelligence (AI) clinical decision support (CDS) systems have the potential to augment surgical risk assessments, but successful adoption depends on an understanding of end-user needs and current workflows. This study reports the initial co-design of MySurgeryRisk, an AI CDS tool to predict the risk of nine post-operative complications in surgical patients. Methods: Semi-structured focus groups and interviews were held as co-design sessions with perioperative physicians at a tertiary academic hospital in the Southeastern United States. Participants were read a surgical vignette and asked questions to elicit an understanding of their current decision-making practices before being introduced to the MySurgeryRisk prototype web interface. They were asked to provide feedback on the user interface and system features. Session transcripts were qualitatively coded, after which thematic analysis took place. Results: Data saturation was reached after 20 surgeons and anesthesiologists from varying career stages participated across 11 co-design sessions. Thematic analysis resulted in five themes: (1) decision-making cognitive processes, (2) current approach to decision-making, (3) future approach to decision-making with MySurgeryRisk, (4) feedback on current MySurgeryRisk prototype, and (5) trustworthy considerations. Conclusion: Clinical providers perceived MySurgeryRisk as a promising CDS tool that factors in a large volume of data and is computed in real-time without any need for manual input. Participants provided feedback on the design of the interface and imaged applications of the tool in the clinical workflow. However, its successful implementation will depend on its actionability and explainability of model outputs, integration into current electronic systems, and calibration of trust among end-users.
中文: MySurgeryRisk作为一种人工智能临床决策支持工具,通过与围手术期医生共同设计以预测术后并发症,被认为因其实时数据处理能力而具有前景,但其成功实施依赖于输出的可操作性、可解释性以及与现有系统的无缝集成。
English: MySurgeryRisk, an AI clinical decision support tool, was co-designed with perioperative physicians to predict postoperative complications and was perceived as promising for its real-time data processing, though its implementation depends on actionability, explainability, and seamless integration into existing systems.

Authors:Chao Huang, Susan Liang, Yunlong Tang, Jing Bi, Li Ma, Yapeng Tian, Chenliang Xu
Title: FreSca: Scaling in Frequency Space Enhances Diffusion Models
Abstract:
Latent diffusion models (LDMs) have achieved remarkable success in a variety of image tasks, yet achieving fine-grained, disentangled control over global structures versus fine details remains challenging. This paper explores frequency-based control within latent diffusion models. We first systematically analyze frequency characteristics across pixel space, VAE latent space, and internal LDM representations. This reveals that the "noise difference" term, derived from classifier-free guidance at each step t, is a uniquely effective and semantically rich target for manipulation. Building on this insight, we introduce FreSca, a novel and plug-and-play framework that decomposes noise difference into low- and high-frequency components and applies independent scaling factors to them via spatial or energy-based cutoffs. Essentially, FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control. We demonstrate its versatility and effectiveness in improving generation quality and structural emphasis on multiple architectures (e.g., SD3, SDXL) and across applications including image generation, editing, depth estimation, and video synthesis, thereby unlocking a new dimension of expressive control within LDMs.
Chinese: 本文提出FreSca框架,通过独立调节噪声差异的低频与高频分量,实现了对隐扩散模型的细粒度控制,无需模型重训练即可提升多种应用的生成质量和结构表现。
English: This paper introduces FreSca, a plug-and-play framework that enables fine-grained control over latent diffusion models by independently scaling low- and high-frequency components of the noise difference, enhancing generation quality and structural emphasis across various applications without requiring model retraining.

Authors:Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei
Title: MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Abstract:
Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.
中文:MergeVQ通过将令牌合并技术与基于矢量量化的生成模型相结合,在统一架构中平衡了图像生成与视觉表征学习,在ImageNet上实现了优异的性能表现,同时保持了高效的令牌利用和推理速度。
English: MergeVQ integrates token merging with vector quantization in a unified architecture to enhance both image generation quality and representation learning efficiency, achieving competitive results on ImageNet while maintaining token efficiency and inference speed.

Authors:Rachmad Vidya Wicaksana Putra, Pasindu Wickramasinghe, Muhammad Shafique
Title: Enabling Efficient Processing of Spiking Neural Networks with On-Chip Learning on Commodity Neuromorphic Processors for Edge AI Systems
Abstract:
The rising demand for energy-efficient edge AI systems (e.g., mobile agents/robots) has increased the interest in neuromorphic computing, since it offers ultra-low power/energy AI computation through spiking neural network (SNN) algorithms on neuromorphic processors. However, their efficient implementation strategy has not been comprehensively studied, hence limiting SNN deployments for edge AI systems. Toward this, we propose a design methodology to enable efficient SNN processing on commodity neuromorphic processors. To do this, we first study the key characteristics of targeted neuromorphic hardware (e.g., memory and compute budgets), and leverage this information to perform compatibility analysis for network selection. Afterward, we employ a mapping strategy for efficient SNN implementation on the targeted processor. Furthermore, we incorporate an efficient on-chip learning mechanism to update the systems' knowledge for adapting to new input classes and dynamic environments. The experimental results show that the proposed methodology leads the system to achieve low latency of inference (i.e., less than 50ms for image classification, less than 200ms for real-time object detection in video streaming, and less than 1ms in keyword recognition) and low latency of on-chip learning (i.e., less than 2ms for keyword recognition), while incurring less than 250mW of processing power and less than 15mJ of energy consumption across the respective different applications and scenarios. These results show the potential of the proposed methodology in enabling efficient edge AI systems for diverse application use-cases.
中文: 该设计方法实现了在神经形态处理器上高效运行脉冲神经网络,为多种边缘AI应用提供了低延迟和低功耗的解决方案。
English: The proposed design methodology enables efficient spiking neural network processing on neuromorphic processors, achieving low latency and power consumption for diverse edge AI applications.

Authors:Rachmad Vidya Wicaksana Putra, Saad Iftikhar, Muhammad Shafique
Title: QSViT: A Methodology for Quantizing Spiking Vision Transformers
Abstract:
Vision Transformer (ViT)-based models have shown state-of-the-art performance (e.g., accuracy) in vision-based AI tasks. However, realizing their capability in resource-constrained embedded AI systems is challenging due to their inherent large memory footprints and complex computations, thereby incurring high power/energy consumption. Recently, Spiking Vision Transformer (SViT)-based models have emerged as alternate low-power ViT networks. However, their large memory footprints still hinder their applicability for resource-constrained embedded AI systems. Therefore, there is a need for a methodology to compress SViT models without degrading the accuracy significantly. To address this, we propose QSViT, a novel design methodology to compress the SViT models through a systematic quantization strategy across different network layers. To do this, our QSViT employs several key steps: (1) investigating the impact of different precision levels in different network layers, (2) identifying the appropriate base quantization settings for guiding bit precision reduction, (3) performing a guided quantization strategy based on the base settings to select the appropriate quantization setting, and (4) developing an efficient quantized network based on the selected quantization setting. The experimental results demonstrate that, our QSViT methodology achieves 22.75% memory saving and 21.33% power saving, while also maintaining high accuracy within 2.1% from that of the original non-quantized SViT model on the ImageNet dataset. These results highlight the potential of QSViT methodology to pave the way toward the efficient SViT deployments on resource-constrained embedded AI systems.
中文: QSViT方法通过系统化的量化策略压缩脉冲视觉变换器模型,在保持高精度的同时显著节省内存和功耗,使其适用于资源受限的嵌入式人工智能系统。
English: The QSViT methodology effectively compresses Spiking Vision Transformer models through systematic quantization, achieving significant memory and power savings while maintaining high accuracy, enabling their deployment in resource-constrained embedded AI systems.

Authors:Chongjie Si, Zhiyi Shi, Xuehui Wang, Yichen Xiao, Xiaokang Yang, Wei Shen
Title: Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
Abstract:
Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence. However, the wide range of tasks and high computational costs make full fine-tuning impractical. To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus. Despite the success of these methods, they are primarily designed for linear layers, focusing on two-dimensional matrices while largely ignoring higher-dimensional parameter spaces like convolutional kernels. Moreover, directly applying these methods to higher-dimensional parameter spaces often disrupts their structural relationships. Given the rapid advancements in matrix-based PEFT methods, rather than designing a specialized strategy, we propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties. Specifically, we treat parameters as elements of a Lie group, with updates modeled as perturbations in the corresponding Lie algebra. These perturbations are mapped back to the Lie group through the exponential map, ensuring smooth, consistent updates that preserve the inherent structure of the parameter space. Extensive experiments on computer vision and natural language processing validate the effectiveness and versatility of our approach, demonstrating clear improvements over existing methods.
中文摘要:本研究提出了一种广义参数高效微调方法,通过李群理论将基于矩阵的方法扩展到高维参数空间,在计算机视觉和自然语言处理任务中实现了优于现有方法的性能。
English Summary: The study introduces a generalized parameter-efficient fine-tuning method that extends matrix-based approaches to higher-dimensional parameter spaces using Lie group theory, achieving superior performance in computer vision and natural language processing tasks.

Authors:Yuang Jia, Xiaojuan Shan, Jun Xia, Guancheng Wan, Yuchen Zhang, Wenke Huang, Mang Ye, Stan Z. Li
Title: Adversarial Curriculum Graph-Free Knowledge Distillation for Graph Neural Networks
Abstract:
Data-free Knowledge Distillation (DFKD) is a method that constructs pseudo-samples using a generator without real data, and transfers knowledge from a teacher model to a student by enforcing the student to overcome dimensional differences and learn to mimic the teacher's outputs on these pseudo-samples. In recent years, various studies in the vision domain have made notable advancements in this area. However, the varying topological structures and non-grid nature of graph data render the methods from the vision domain ineffective. Building upon prior research into differentiable methods for graph neural networks, we propose a fast and high-quality data-free knowledge distillation approach in this paper. Without compromising distillation quality, the proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs by leveraging the Binary Concrete distribution to model the graph structure and introducing a spatial complexity tuning parameter. This approach enables efficient gradient computation for the graph structure, thereby accelerating the overall distillation process. Additionally, ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student's dimensions and reusing the teacher's classifier. Moreover, it equips graph knowledge distillation with a CL-based strategy to ensure the student learns graph structures progressively. Extensive experiments demonstrate that ACGKD achieves state-of-the-art performance in distilling knowledge from GNNs without training data.
中文: 提出的ACGKD方法通过二元具体分布建模降低空间复杂度并消除维度模糊性,实现了高效的图神经网络无数据知识蒸馏,在无需训练数据的情况下达到了最先进的性能。
English: The proposed ACGKD method enables efficient data-free knowledge distillation for graph neural networks by reducing spatial complexity through binary concrete distribution modeling and eliminating dimensional ambiguity, while achieving state-of-the-art performance without training data.

Authors:Linshan Wu, Yuxiang Nie, Sunan He, Jiaxin Zhuang, Luyang Luo, Neeraj Mahboobani, Varut Vardhanabhuti, Ronald Cheong Kin Chan, Yifan Peng, Pranav Rajpurkar, Hao Chen
Title: UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
Abstract:
The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate AI-generated findings with visual evidence (e.g., tiny lesions) in images and interpret the results of AI models. To address this challenge, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation, which is capable of generating accurate diagnostic findings and simultaneously segmenting the corresponding biomedical targets. UniBiomed is based on a novel integration of Multi-modal Large Language Model and Segment Anything Model, which can effectively unify diverse biomedical tasks in universal training for advancing grounded interpretation. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, region annotations, and text descriptions across ten biomedical imaging modalities. Extensive validation on 70 internal and 14 external datasets demonstrated the state-of-the-art performance of UniBiomed in diverse biomedical tasks, including image segmentation, disease recognition, region-aware diagnosis, vision question answering, and report generation. In summary, UniBiomed is a powerful and versatile biomedical foundation model, unlocking the untapped grounded interpretation capability for optimizing AI-assisted biomedical image analysis.
中文:UniBiomed作为首个通用基础模型,通过整合多模态大语言模型与分割技术,能同时生成精确诊断结果并定位相关生物医学目标,有效解决了临床AI应用中解释性不足的难题。
English: UniBiomed is a universal foundation model that integrates multi-modal large language and segmentation capabilities to generate accurate diagnostic findings while simultaneously localizing corresponding biomedical objects, addressing the interpretability gap in clinical AI applications.

Authors:Minghui Xu, Wenxuan Yu, Guangyong Shang, Guangpeng Qi, Dongliang Duan, Shan Wang, Kun Li, Yue Zhang, Xiuzhen Cheng
Title: Starfish: Rebalancing Multi-Party Off-Chain Payment Channels
Abstract:
Blockchain technology has revolutionized the way transactions are executed, but scalability remains a major challenge. Payment Channel Network (PCN), as a Layer-2 scaling solution, has been proposed to address this issue. However, skewed payments can deplete the balance of one party within a channel, restricting the ability of PCNs to transact through a path and subsequently reducing the transaction success rate. To address this issue, the technology of rebalancing has been proposed. However, existing rebalancing strategies in PCNs are limited in their capacity and efficiency. Cycle-based approaches only address rebalancing within groups of nodes that form a cycle network, while non-cycle-based approaches face high complexity of on-chain operations and limitations on rebalancing capacity. In this study, we propose Starfish, a rebalancing approach that captures the star-shaped network structure to provide high rebalancing efficiency and large channel capacity. Starfish requires only $N$-time on-chain operations to connect independent channels and aggregate the total budget of all channels. To demonstrate the correctness and advantages of our method, we provide a formal security proof of the Starfish protocol and conduct comparative experiments with existing rebalancing techniques.
中文摘要:本研究提出的Starfish再平衡方法利用星型网络结构提升支付通道网络的效率与容量,仅需N次链上操作即可整合通道资源,并通过形式化安全证明与对比实验验证了其相较于现有技术的优越性。
English Summary: The proposed Starfish rebalancing approach leverages star-shaped network structures to enhance efficiency and capacity in Payment Channel Networks, requiring minimal on-chain operations while outperforming existing methods through formal security proofs and experimental validation.

Authors:Floriane Magera, Thomas Hoyoux, Martin Castin, Olivier Barnich, Anthony Cioppa, Marc Van Droogenbroeck
Title: Can Geometry Save Central Views for Sports Field Registration?
Abstract:
Single-frame sports field registration often serves as the foundation for extracting 3D information from broadcast videos, enabling applications related to sports analytics, refereeing, or fan engagement. As sports fields have rigorous specifications in terms of shape and dimensions of their line, circle and point components, sports field markings are commonly used as calibration targets for this task. However, because of the sparse and uneven distribution of field markings, close-up camera views around central areas of the field often depict only line and circle markings. On these views, sports field registration is challenging for the vast majority of existing methods, as they focus on leveraging line field markings and their intersections. It is indeed a challenge to include circle correspondences in a set of linear equations. In this work, we propose a novel method to derive a set of points and lines from circle correspondences, enabling the exploitation of circle correspondences for both sports field registration and image annotation. In our experiments, we illustrate the benefits of our bottom-up geometric method against top-performing detectors and show that our method successfully complements them, enabling sports field registration in difficult scenarios.
中文摘要:本文提出一种创新方法,将圆形对应关系转化为点和线,以改进在困难近景视角下的运动场配准,有效补充了现有技术。
English Summary: This paper introduces a novel method that converts circle correspondences into points and lines to enhance sports field registration in challenging close-up views, effectively complementing existing techniques.

Authors:Qidong Liu, Xiangyu Zhao, Yejing Wang, Zijian Zhang, Howard Zhong, Chong Chen, Xiang Li, Wei Huang, Feng Tian
Title: Bridge the Domains: Large Language Models Enhanced Cross-domain Sequential Recommendation
Abstract:
Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user's historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item relationships, compromising the practicability. The latter refers to the difficulties in learning the complex transition patterns from the mixed behavior sequences. With powerful representation and reasoning abilities, Large Language Models (LLMs) are promising to address these two problems by bridging the items and capturing the user's preferences from a semantic view. Therefore, we propose an LLMs Enhanced Cross-domain Sequential Recommendation model (LLM4CDSR). To obtain the semantic item relationships, we first propose an LLM-based unified representation module to represent items. Then, a trainable adapter with contrastive regularization is designed to adapt the CDSR task. Besides, a hierarchical LLMs profiling module is designed to summarize user cross-domain preferences. Finally, these two modules are integrated into the proposed tri-thread framework to derive recommendations. We have conducted extensive experiments on three public cross-domain datasets, validating the effectiveness of LLM4CDSR. We have released the code online.
中文:LLM4CDSR模型利用大语言模型,通过语义化表示物品和分层分析用户偏好,解决了跨域序列推荐中的重叠依赖和转移复杂性难题,实验验证了其有效性。
English: The LLM4CDSR model leverages large language models to address the overlap dilemma and transition complexity in cross-domain sequential recommendation by representing items semantically and profiling user preferences hierarchically, demonstrating effectiveness in experiments.

Authors:Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya Patil
Title: Hierarchical and Multimodal Data for Daily Activity Understanding
Abstract:
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: https://alregib.ece.gatech.edu/software-and-datasets/darai-daily-activity-recordings-for-artificial-intelligence-and-machine-learning/
中文: DARai是一个多模态数据集,包含来自10个环境中50名参与者的200多小时分层标注活动记录,通过20种传感器和反事实活动设计推动以人为中心的AI研究。
English: DARai is a multimodal dataset featuring over 200 hours of hierarchically annotated human activity recordings from 20 sensors across 10 environments, designed to advance AI applications through sensor fusion and domain-variant experiments.

Authors:Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yinchuan Li, Yingcong Chen
Title: DiMeR: Disentangled Mesh Reconstruction Model
Abstract:
We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision. To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the sole input for geometry prediction in DiMeR, while the texture is estimated from RGB images. Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%.
Chinese: DiMeR提出了一种具有三维监督的几何-纹理解耦模型,解决了稀疏视图重建中的优化模糊和网格提取低效问题,在基准数据集上将倒角距离显著降低了30%以上。
English: DiMeR introduces a geometry-texture disentangled model with 3D supervision to address ambiguous optimization and inefficient mesh extraction in sparse-view reconstruction, achieving over 30% improvement in Chamfer Distance on benchmark datasets.

Authors:Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu
Title: BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation
Abstract:
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.
中文摘要:BadVideo是首个针对文本到视频生成模型的后门攻击框架,通过时空组合和动态元素转换利用模型固有冗余嵌入恶意内容,在保持正常生成质量的同时成功规避内容审核系统。
English Summary: BadVideo is the first backdoor attack framework exploiting the inherent redundancy in text-to-video models to embed hidden malicious content through spatio-temporal manipulation, effectively bypassing content moderation while maintaining output quality.

Authors:Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua
Title: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Abstract:
Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present \textbf{Quicksviewer}, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45$\times$ compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.
中文: Quicksviewer采用基于Gumbel Softmax的动态视频分区方法,根据时间密度生成不同立方体,实现45倍压缩率,以少量训练数据获得卓越性能,并能高效分析视频中的连续事件。
English: Quicksviewer introduces a dynamic video partitioning method using Gumbel Softmax to create varying cubes based on temporal density, achieving 45× compression and superior performance with minimal training data while enabling efficient analysis of continuous events.

Authors:Yunpu Zhao, Rui Zhang, Junbin Xiao, Ruibo Hou, Jiaming Guo, Zihao Zhang, Yifan Hao, Yunji Chen
Title: Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation
Abstract:
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.
中文摘要:本研究提出的语义扰动置信度校准(CSP)框架通过高斯噪声扰动模拟视觉不确定性,结合两阶段训练显著提升了视觉语言模型的置信度校准效果,在保持性能的同时增强了模型的可靠性和可解释性。
English Summary: The proposed Confidence Calibration through Semantic Perturbation (CSP) framework effectively improves vision-language models' calibration by mapping visual uncertainty to confidence levels through Gaussian noise perturbation and a two-stage training process, enhancing reliability without compromising performance.

Authors:Xixi Wan, Aihua Zheng, Zi Wang, Bo Jiang, Jin Tang, Jixin Ma
Title: Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning
Abstract:
Multi-modal data provides abundant and diverse object information, crucial for effective modal interactions in Re-Identification (ReID) tasks. However, existing approaches often overlook the quality variations in local features and fail to fully leverage the complementary information across modalities, particularly in the case of low-quality features. In this paper, we propose to address this issue by leveraging a novel graph reasoning model, termed the Modality-aware Graph Reasoning Network (MGRNet). Specifically, we first construct modality-aware graphs to enhance the extraction of fine-grained local details by effectively capturing and modeling the relationships between patches. Subsequently, the selective graph nodes swap operation is employed to alleviate the adverse effects of low-quality local features by considering both local and global information, enhancing the representation of discriminative information. Finally, the swapped modality-aware graphs are fed into the local-aware graph reasoning module, which propagates multi-modal information to yield a reliable feature representation. Another advantage of the proposed graph reasoning approach is its ability to reconstruct missing modal information by exploiting inherent structural relationships, thereby minimizing disparities between different modalities. Experimental results on four benchmarks (RGBNT201, Market1501-MM, RGBNT100, MSVR310) indicate that the proposed method achieves state-of-the-art performance in multi-modal object ReID. The code for our method will be available upon acceptance.
中文摘要:本研究提出的模态感知图推理网络(MGRNet)通过构建模态感知图来提升局部特征质量,利用图节点交换和跨模态信息传播解决多模态ReID中的特征互补问题,在多个基准测试中实现了最优性能。
English Summary: The proposed Modality-aware Graph Reasoning Network (MGRNet) addresses limitations in multi-modal ReID by enhancing local feature quality through graph-based modeling and cross-modal information exchange, achieving state-of-the-art performance on multiple benchmarks.

Authors:Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang
Title: Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Abstract:
Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
中文: Video-MMLU作为评估语言多模态模型理解跨学科讲座能力的综合基准,揭示了现有模型在感知推理任务上的不足,并探讨了视觉标记数量与模型规模对性能的影响。
English: Video-MMLU is a comprehensive benchmark assessing language multimodal models' ability to comprehend multi-discipline lectures, revealing current models' limitations in perception-reasoning tasks and examining the impact of visual tokens and model scale.

Authors:Saeid Jamshidi, Amin Nikanjam, Kawser Wazed Nafi, Foutse Khomh, Rasoul Rasta
Title: Application of Deep Reinforcement Learning for Intrusion Detection in Internet of Things: A Systematic Review
Abstract:
The Internet of Things (IoT) has significantly expanded the digital landscape, interconnecting an unprecedented array of devices, from home appliances to industrial equipment. This growth enhances functionality, e.g., automation, remote monitoring, and control, and introduces substantial security challenges, especially in defending these devices against cyber threats. Intrusion Detection Systems (IDS) are crucial for securing IoT; however, traditional IDS often struggle to adapt to IoT networks' dynamic and evolving nature and threat patterns. A potential solution is using Deep Reinforcement Learning (DRL) to enhance IDS adaptability, enabling them to learn from and react to their operational environment dynamically. This systematic review examines the application of DRL to enhance IDS in IoT settings, covering research from the past ten years. This review underscores the state-of-the-art DRL techniques employed to improve adaptive threat detection and real-time security across IoT domains by analyzing various studies. Our findings demonstrate that DRL significantly enhances IDS capabilities by enabling systems to learn and adapt from their operational environment. This adaptability allows IDS to improve threat detection accuracy and minimize false positives, making it more effective in identifying genuine threats while reducing unnecessary alerts. Additionally, this systematic review identifies critical research gaps and future research directions, emphasizing the necessity for more diverse datasets, enhanced reproducibility, and improved integration with emerging IoT technologies. This review aims to foster the development of dynamic and adaptive IDS solutions essential for protecting IoT networks against sophisticated cyber threats.
中文: 本系统综述探讨了深度强化学习(DRL)如何提升物联网(IoT)环境中入侵检测系统(IDS)的性能,表明DRL能增强适应性、提高威胁检测精度并减少误报,同时指出了未来研究的关键方向。
English: This systematic review explores how Deep Reinforcement Learning (DRL) enhances Intrusion Detection Systems (IDS) in IoT environments, showing that DRL improves adaptability, threat detection accuracy, and reduces false positives, while also identifying research gaps for future development.

Authors:Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, Sheng Zhong
Title: Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask
Abstract:
Large Language Models are a promising tool for automated vulnerability detection, thanks to their success in code generation and repair. However, despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? Current evaluations, which often assess models on isolated functions or files, ignore the broader execution and data-flow context essential for understanding vulnerabilities. This oversight leads to two types of misleading outcomes: incorrect conclusions and flawed rationales, collectively undermining the reliability of prior assessments. Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations. To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically incorporates contextual information into LLM-based vulnerability detection. We construct a context-rich dataset of 2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs across four model families. Our framework elicits both binary predictions and natural-language rationales, which are further validated using LLM-as-a-judge techniques. Our findings overturn existing misconceptions. When provided with sufficient context, SOTA LLMs achieve significantly improved performance (e.g., 0.7 F1-score on key CWEs), with 0.8 precision. We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.
中文: 大语言模型在漏洞检测方面具有巨大潜力,当采用包含上下文信息的CORRECT评估框架时,其性能显著提升,推翻了先前关于模型不可靠和性能受限的误解。
English: Large Language Models show significant potential for vulnerability detection when evaluated with proper contextual information, as demonstrated by the CORRECT framework, which overturns previous misconceptions about their unreliability and performance limitations.

Authors:Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang
Title: Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models
Abstract:
Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io
中文摘要:本研究提出Chain-of-Modality(CoM)提示策略,通过融合人类演示中的视频与肌肉/音频信号,使机器人能够从多模态数据中提取任务规划和控制参数,在实验中相比基线方法实现了三倍的精度提升。
English Summary: This study introduces Chain-of-Modality (CoM), a prompting strategy that enables robots to learn manipulation tasks from multimodal human demonstrations—combining video with muscle or audio signals—to extract detailed task plans and control parameters, achieving a threefold accuracy improvement over baseline methods.

Authors:Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, Zhaopeng Cui
Title: HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation
Abstract:
Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.
中文摘要:HiScene提出了一种层次化框架,通过将场景视为等距视图下的分层对象,结合视频扩散的不可见补全技术和形状先验注入,实现了具备组合结构的高保真3D场景生成。
English Summary: HiScene introduces a hierarchical framework that bridges 2D image and 3D object generation, enabling high-fidelity scene creation with compositional structure through video-diffusion-based amodal completion and shape prior injection.

Authors:Andrea Piergentili, Beatrice Savoldi, Matteo Negri, Luisa Bentivogli
Title: An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Abstract:
Gender-neutral translation (GNT) aims to avoid expressing the gender of human referents when the source text lacks explicit cues about the gender of those referents. Evaluating GNT automatically is particularly challenging, with current solutions being limited to monolingual classifiers. Such solutions are not ideal because they do not factor in the source sentence and require dedicated data and fine-tuning to scale to new languages. In this work, we address such limitations by investigating the use of large language models (LLMs) as evaluators of GNT. Specifically, we explore two prompting approaches: one in which LLMs generate sentence-level assessments only, and another, akin to a chain-of-thought approach, where they first produce detailed phrase-level annotations before a sentence-level judgment. Through extensive experiments on multiple languages with five models, both open and proprietary, we show that LLMs can serve as evaluators of GNT. Moreover, we find that prompting for phrase-level annotations before sentence-level assessments consistently improves the accuracy of all models, providing a better and more scalable alternative to current solutions.
中文: 本研究证明,通过采用提示策略——尤其在句子级判断前加入短语级注释——大型语言模型能够有效评估性别中立翻译,为多语言场景提供了比现有方案更优且可扩展的解决方案。
English: This study demonstrates that large language models can effectively evaluate gender-neutral translation by using prompt strategies, particularly when incorporating phrase-level annotations before sentence-level judgments to enhance accuracy across multiple languages.

Authors:Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang
Title: Efficient Reasoning Models: A Survey
Abstract:
Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.
中文摘要:推理模型通过长思维链实现高精度但带来巨大计算开销,本综述将加速方法归纳为缩短推理链、缩小模型规模及优化解码策略三大方向。
English Summary: Reasoning models achieve high accuracy through extended Chain-of-Thoughts but incur significant computational costs, prompting this survey to categorize acceleration methods into shorter reasoning chains, smaller models, and faster decoding strategies.

Authors:Marco Salmè, Lorenzo Tronchin, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Title: Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains
Abstract:
Data scarcity remains a critical bottleneck impeding technological advancements across various domains, including but not limited to medicine and precision agriculture. To address this challenge, we explore the potential of Deep Generative Models (DGMs) in producing synthetic data that satisfies the Generative Learning Trilemma: fidelity, diversity, and sampling efficiency. However, recognizing that these criteria alone are insufficient for practical applications, we extend the trilemma to include utility, robustness, and privacy, factors crucial for ensuring the applicability of DGMs in real-world scenarios. Evaluating these metrics becomes particularly challenging in data-scarce environments, as DGMs traditionally rely on large datasets to perform optimally. This limitation is especially pronounced in domains like medicine and precision agriculture, where ensuring acceptable model performance under data constraints is vital. To address these challenges, we assess the Generative Learning Trilemma in data-scarcity settings using state-of-the-art evaluation metrics, comparing three prominent DGMs: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs). Furthermore, we propose a comprehensive framework to assess utility, robustness, and privacy in synthetic data generated by DGMs. Our findings demonstrate varying strengths among DGMs, with each model exhibiting unique advantages based on the application context. This study broadens the scope of the Generative Learning Trilemma, aligning it with real-world demands and providing actionable guidance for selecting DGMs tailored to specific applications.
中文: 本研究通过将生成学习三难困境扩展至实用性、鲁棒性和隐私性,评估了三种深度生成模型,为解决数据稀缺问题提供了针对实际应用的指导。
English: This study addresses data scarcity by extending the Generative Learning Trilemma to include utility, robustness, and privacy, evaluating three deep generative models to provide practical guidance for real-world applications.

Authors:Jiajie Su, Qiyong Zhong, Yunshan Ma, Weiming Liu, Chaochao Chen, Xiaolin Zheng, Jianwei Yin, Tat-Seng Chua
Title: Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation
Abstract:
Session-based recommendation (SBR) predicts the next item based on anonymous sessions. Traditional SBR explores user intents based on ID collaborations or auxiliary content. To further alleviate data sparsity and cold-start issues, recent Multimodal SBR (MSBR) methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness. Considering semantic reasoning abilities of Large Language Models (LLM), we focus on the LLM-enhanced MSBR scenario in this paper, which leverages LLM cognition for comprehensive multimodal representation generation, to enhance downstream MSBR. Tackling this problem faces two challenges: i) how to obtain LLM cognition on both transitional patterns and inherent multimodal knowledge, ii) how to align both features into one unified LLM, minimize discrepancy while maximizing representation utility. To this end, we propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR. TPAD establishes parallel Knowledge-MLLM and Transfer-MLLM, where the former interprets item knowledge-reflected features and the latter extracts transition-aware features underneath sessions. A transitional pattern alignment module harnessing mutual information estimation theory unites two MLLMs, alleviating distribution discrepancy and distilling transitional patterns into modal representations. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework.
中文: 本文提出TPAD框架,通过并行多模态大语言模型和互信息估计,解耦并对齐会话推荐中的转换模式与多模态知识,有效缓解数据稀疏性问题。
English: This paper introduces TPAD, a multimodal LLM-enhanced framework that addresses data sparsity in session-based recommendations by decoupling and aligning transitional patterns with multimodal knowledge through parallel MLLMs and mutual information estimation.

Authors:Saeid Jamshidi, Kawser Wazed Nafi, Amin Nikanjam, Foutse Khomh
Title: Evaluating Machine Learning-Driven Intrusion Detection Systems in IoT: Performance and Energy Consumption
Abstract:
In the evolving landscape of the Internet of Things (IoT), Machine Learning (ML)-based Intrusion Detection Systems (IDS) represent a significant advancement, especially when integrated with Software-Defined Networking (SDN). These systems play a critical role in enhancing security infrastructure within resource-constrained IoT systems. Despite their growing adoption, limited research has explored the impact of ML-based IDS on key performance metrics, such as CPU load, CPU usage, and energy consumption, particularly under real-time cyber threats. This study bridges that gap through an empirical evaluation of cutting-edge ML-based IDSs deployed at the edge of IoT networks under both benign and attack scenarios. Additionally, we investigate how SDN's centralized control and dynamic resource management influence IDS performance. Our experimental framework compares traditional ML-based IDS with deep learning (DL)-based counterparts, both with and without SDN integration. Results reveal that edge-deployed ML-based IDSs significantly impact system performance during cyber threats, with marked increases in resource consumption. SDN integration further influences these outcomes, emphasizing the need for optimized architectural design. Statistical analysis using ANOVA confirms the significance of our findings. This research provides critical insights into the performance and trade-offs of deploying ML-based IDSs in edge-based IoT systems.
中文: 本研究实证评估了基于机器学习的入侵检测系统在边缘物联网网络中的性能,揭示了网络威胁下显著的资源消耗,并展示了软件定义网络集成对这些结果的影响,从而强调了优化架构设计的必要性。
English: This study empirically evaluates the performance of ML-based Intrusion Detection Systems in edge IoT networks, revealing significant resource consumption during cyber threats and demonstrating how SDN integration affects these outcomes, thereby highlighting the need for optimized architectural design.

Authors:Congcong Wen, Geeta Chandra Raju Bethala, Yu Hao, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Baoru Huang, Anh Nguyen, Anthony Tzes, Yi Fang
Title: Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation
Abstract:
Humanoid loco-manipulation, which integrates whole-body locomotion with dexterous manipulation, remains a fundamental challenge in robotics. Beyond whole-body coordination and balance, a central difficulty lies in understanding human instructions and translating them into coherent sequences of embodied actions. Recent advances in foundation models provide transferable multimodal representations and reasoning capabilities, yet existing efforts remain largely restricted to either locomotion or manipulation in isolation, with limited applicability to humanoid settings. In this paper, we propose Humanoid-COA, the first humanoid agent framework that integrates foundation model reasoning with an Embodied Chain-of-Action (CoA) mechanism for zero-shot loco-manipulation. Within the perception--reasoning--action paradigm, our key contribution lies in the reasoning stage, where the proposed CoA mechanism decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial inference, and whole-body action reasoning. Extensive experiments on two humanoid robots, Unitree H1-2 and G1, in both an open test area and an apartment environment, demonstrate that our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks, achieving robust generalization to long-horizon and unstructured scenarios. Project page: https://humanoid-coa.github.io/
中文摘要:本文提出Humanoid-COA框架,通过结合基础模型推理与具身行动链机制,首次实现人形机器人对高级人类指令的零样本全身运动与操作,在复杂场景中展现出优越的泛化能力。
English Summary: This paper introduces Humanoid-COA, a novel framework that combines foundation model reasoning with an Embodied Chain-of-Action mechanism to enable humanoid robots to perform zero-shot integrated locomotion and manipulation by decomposing human instructions into structured action sequences.

Authors:Congcong Wen, Geeta Chandra Raju Bethala, Yu Hao, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Baoru Huang, Anh Nguyen, Mengyu Wang, Anthony Tzes, Yi Fang
Title: Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation
Abstract:
Humanoid loco-manipulation, which integrates whole-body locomotion with dexterous manipulation, remains a fundamental challenge in robotics. Beyond whole-body coordination and balance, a central difficulty lies in understanding human instructions and translating them into coherent sequences of embodied actions. Recent advances in foundation models provide transferable multimodal representations and reasoning capabilities, yet existing efforts remain largely restricted to either locomotion or manipulation in isolation, with limited applicability to humanoid settings. In this paper, we propose Humanoid-COA, the first humanoid agent framework that integrates foundation model reasoning with an Embodied Chain-of-Action (CoA) mechanism for zero-shot loco-manipulation. Within the perception--reasoning--action paradigm, our key contribution lies in the reasoning stage, where the proposed CoA mechanism decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial inference, and whole-body action reasoning. Extensive experiments on two humanoid robots, Unitree H1-2 and G1, in both an open test area and an apartment environment, demonstrate that our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks, achieving robust generalization to long-horizon and unstructured scenarios. Project page: https://humanoid-coa.github.io/
中文摘要:本文提出Humanoid-COA框架,通过结合基础模型推理与具身行动链机制,首次实现人形机器人对高级人类指令的零样本全身运动与操作,在复杂场景中展现出优越的泛化能力。
English Summary: This paper introduces Humanoid-COA, a novel framework that combines foundation model reasoning with an Embodied Chain-of-Action mechanism to enable humanoid robots to perform zero-shot integrated locomotion and manipulation by decomposing human instructions into structured action sequences.

Authors:Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, Jian Guo
Title: SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Abstract:
Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Despite recent advancements in enhancing human-computer interaction within database applications, significant challenges persist, particularly regarding the inference performance in complex scenarios involving multi-table joins and nested queries. Current methodologies primarily utilize supervised fine-tuning (SFT) to train the NL2SQL model, which may limit adaptability and interpretability in new environments (e.g., finance and healthcare). In order to enhance the reasoning performance of the NL2SQL model in the above complex situations, we introduce SQL-R1, a novel NL2SQL reasoning model trained by the reinforcement learning (RL) algorithms. We design a specialized RL-based reward function tailored for NL2SQL tasks and discussed the impact of cold start on the effectiveness of intensive training. In addition, we achieve competitive accuracy using only a tiny amount of synthetic NL2SQL data for augmented training and further explore data engineering for RL. In existing experiments, SQL-R1 achieves execution accuracy of 88.6% and 66.6% on the benchmark Spider and BIRD, respectively, only using the 7B base model.
Chinese: SQL-R1是一种基于强化学习训练的新型NL2SQL模型,旨在提升多表连接和嵌套查询等复杂场景下的推理性能,仅用少量合成数据即在基准测试中取得了优异准确率。
English: SQL-R1 is a novel NL2SQL model trained with reinforcement learning to improve reasoning performance in complex scenarios like multi-table joins and nested queries, achieving competitive accuracy on benchmarks with minimal synthetic data.

Authors:Ryan Y. Lin, Julius Berner, Valentin Duruisseaux, David Pitt, Daniel Leibovici, Jean Kossaifi, Kamyar Azizzadenesheli, Anima Anandkumar
Title: Enabling Automatic Differentiation with Mollified Graph Neural Operators
Abstract:
Physics-informed neural operators offer a powerful framework for learning solution operators of partial differential equations (PDEs) by combining data and physics losses. However, these physics losses rely on derivatives. Computing these derivatives remains challenging, with spectral and finite difference methods introducing approximation errors due to finite resolution. Here, we propose the mollified graph neural operator (mGNO), the first method to leverage automatic differentiation and compute \emph{exact} gradients on arbitrary geometries. This enhancement enables efficient training on irregular grids and varying geometries while allowing seamless evaluation of physics losses at randomly sampled points for improved generalization. For a PDE example on regular grids, mGNO paired with autograd reduced the L2 relative data error by 20x compared to finite differences, although training was slower. It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough. On these unstructured point clouds, mGNO leads to errors that are consistently 2 orders of magnitude lower than machine learning baselines (Meta-PDE) for comparable runtimes, and also delivers speedups from 1 to 3 orders of magnitude compared to the numerical solver for similar accuracy. mGNOs can also be used to solve inverse design and shape optimization problems on complex geometries.
中文: 平滑图神经算子(mGNO)利用自动微分计算精确梯度,支持在任意几何结构上高效训练,相比现有方法实现了精度和速度的显著提升。
English: The mollified graph neural operator (mGNO) introduces automatic differentiation to compute exact gradients for physics-informed learning, enabling efficient training on irregular geometries and achieving significantly higher accuracy and speed than existing methods.

Authors:Saeid Jamshidi, Amin Nikanjam, Nafi Kawser Wazed, Foutse Khomh
Title: Leveraging Machine Learning Techniques in Intrusion Detection Systems for Internet of Things
Abstract:
As the Internet of Things (IoT) continues to expand, ensuring the security of connected devices has become increasingly critical. Traditional Intrusion Detection Systems (IDS) often fall short in managing the dynamic and large-scale nature of IoT networks. This paper explores how Machine Learning (ML) and Deep Learning (DL) techniques can significantly enhance IDS performance in IoT environments. We provide a thorough overview of various IDS deployment strategies and categorize the types of intrusions common in IoT systems. A range of ML methods -- including Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees, and Random Forests -- are examined alongside advanced DL models such as LSTM, CNN, Autoencoders, RNNs, and Deep Belief Networks. Each technique is evaluated based on its accuracy, efficiency, and suitability for real-world IoT applications. We also address major challenges such as high false positive rates, data imbalance, encrypted traffic analysis, and the resource constraints of IoT devices. In addition, we highlight the emerging role of Generative AI and Large Language Models (LLMs) in improving threat detection, automating responses, and generating intelligent security policies. Finally, we discuss ethical and privacy concerns, underscoring the need for responsible and transparent implementation. This paper aims to provide a comprehensive framework for developing adaptive, intelligent, and secure IDS solutions tailored for the evolving landscape of IoT.
中文摘要:本文探讨了机器学习和深度学习技术如何通过评估多种算法并应对数据不平衡和设备限制等关键挑战,来提升物联网网络入侵检测系统的性能,同时兼顾伦理考量。
English Summary: This paper investigates how Machine Learning and Deep Learning techniques can improve Intrusion Detection Systems for IoT networks by evaluating various algorithms and addressing key challenges like data imbalance and device constraints, while also considering ethical implications.

Authors:Alhad Daftardar, Jianqiao Mo, Joey Ah-kiow, Benedikt Bünz, Ramesh Karri, Siddharth Garg, Brandon Reagen
Title: Need for zkSpeed: Accelerating HyperPlonk for Zero-Knowledge Proofs
Abstract:
Zero-Knowledge Proofs (ZKPs) are rapidly gaining importance in privacy-preserving and verifiable computing. ZKPs enable a proving party to prove the truth of a statement to a verifying party without revealing anything else. ZKPs have applications in blockchain technologies, verifiable machine learning, and electronic voting, but have yet to see widespread adoption due to the computational complexity of the proving process. Recent works have accelerated the key primitives of state-of-the-art ZKP protocols on GPU and ASIC. However, the protocols accelerated thus far face one of two challenges: they either require a trusted setup for each application, or they generate larger proof sizes with higher verification costs, limiting their applicability in scenarios with numerous verifiers or strict verification time constraints. This work presents an accelerator, zkSpeed, for HyperPlonk, a state-of-the-art ZKP protocol that supports both one-time, universal setup and small proof sizes for typical ZKP applications in publicly verifiable, consensus-based systems. We accelerate the entire protocol, including two major primitives: SumCheck and Multi-scalar Multiplications (MSMs). We develop a full-chip architecture using 366.46 mm$^2$ and 2 TB/s of bandwidth to accelerate the entire proof generation process, achieving geometric mean speedups of 801$\times$ over CPU baselines.
中文: 零知识证明在隐私保护和可验证计算中至关重要,本文提出的zkSpeed加速器针对HyperPlonk协议,实现了大幅性能提升,同时支持通用设置和小型证明尺寸。
English: Zero-Knowledge Proofs (ZKPs) are critical for secure and private computations, and this work introduces zkSpeed, an accelerator for the HyperPlonk protocol that achieves significant speed improvements while supporting universal setup and compact proofs.

Authors:Vahid Majdinasab, Amin Nikanjam, Foutse Khomh
Title: Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search
Abstract:
The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.
中文: Prism是一个动态基准测试框架,通过树状状态建模和多智能体评估系统全面评测大语言模型,能够突破静态基准的局限,随模型发展自适应调整,并提供详尽的性能诊断分析。
English: Prism is a dynamic benchmarking framework that uses tree-based state modeling and multi-agent evaluation to comprehensively assess large language models, overcoming the limitations of static benchmarks by adapting to model advancements and providing detailed performance diagnostics.

Authors:Bingyang Wang, Kaer Huang, Bin Li, Yiqiang Yan, Lihe Zhang, Huchuan Lu, You He
Title: EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively
Abstract:
Open-World Tracking (OWT) aims to track every object of any category, which requires the model to have strong generalization capabilities. Trackers can improve their generalization ability by leveraging Visual Language Models (VLMs). However, challenges arise with the fine-tuning strategies when VLMs are transferred to OWT: full fine-tuning results in excessive parameter and memory costs, while the zero-shot strategy leads to sub-optimal performance. To solve the problem, EffOWT is proposed for efficiently transferring VLMs to OWT. Specifically, we build a small and independent learnable side network outside the VLM backbone. By freezing the backbone and only executing backpropagation on the side network, the model's efficiency requirements can be met. In addition, EffOWT enhances the side network by proposing a hybrid structure of Transformer and CNN to improve the model's performance in the OWT field. Finally, we implement sparse interactions on the MLP, thus reducing parameter updates and memory costs significantly. Thanks to the proposed methods, EffOWT achieves an absolute gain of 5.5% on the tracking metric OWTA for unknown categories, while only updating 1.3% of the parameters compared to full fine-tuning, with a 36.4% memory saving. Other metrics also demonstrate obvious improvement.
中文摘要:EffOWT通过构建独立的混合Transformer-CNN侧网络,在冻结主干网络的同时实现视觉语言模型向开放世界跟踪的高效迁移,仅更新1.3%参数就显著提升了未知类别跟踪性能。
English Summary: EffOWT efficiently transfers Visual Language Models to Open-World Tracking by using a small side network with a hybrid Transformer-CNN structure, achieving significant performance gains with minimal parameter updates and memory savings.

Authors:Martin Weyssow, Chengran Yang, Junkai Chen, Ratnadira Widyasari, Ting Zhang, Huihui Huang, Huu Hung Nguyen, Yan Naing Tun, Tan Bui, Yikun Li, Ang Han Wei, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo
Title: R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation
Abstract:
Large language models (LLMs) have shown promising performance in software vulnerability detection, yet their reasoning capabilities remain unreliable. We propose R2Vul, a method that combines reinforcement learning from AI feedback (RLAIF) and structured reasoning distillation to teach small code LLMs to detect vulnerabilities while generating security-aware explanations. Unlike prior chain-of-thought and instruction tuning approaches, R2Vul rewards well-founded over deceptively plausible vulnerability explanations through RLAIF, which results in more precise detection and high-quality reasoning generation. To support RLAIF, we construct the first multilingual preference dataset for vulnerability detection, comprising 18,000 high-quality samples in C\#, JavaScript, Java, Python, and C. We evaluate R2Vul across five programming languages and against four static analysis tools, eight state-of-the-art LLM-based baselines, and various fine-tuning approaches. Our results demonstrate that a 1.5B R2Vul model exceeds the performance of its 32B teacher model and leading commercial LLMs such as Claude-4-Opus. Furthermore, we introduce a lightweight calibration step that reduces false positive rates under varying imbalanced data distributions. Finally, through qualitative analysis, we show that both LLM and human evaluators consistently rank R2Vul model's reasoning higher than other reasoning-based baselines.
中文: R2Vul通过强化学习和结构化推理增强小型代码大语言模型的漏洞检测能力,在多语言验证中超越大型模型并降低误报率。
English: R2Vul enhances small code LLMs' vulnerability detection by integrating reinforcement learning and structured reasoning, outperforming larger models and reducing false positives with multilingual validation.

Authors:Hongchao Fang, Yixin Liu, Jiangshu Du, Can Qin, Ran Xu, Feng Liu, Lichao Sun, Dongwon Lee, Lifu Huang, Wenpeng Yin
Title: Could AI Trace and Explain the Origins of AI-Generated Images and Text?
Abstract:
AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions--such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases--remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model's original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI's own models, such as DALL-E and GPT-4o itself.
中文摘要:该摘要介绍了AI-FAKER这一大型多模态数据集,旨在填补AI生成内容检测与溯源的研究空白,实验发现检测效果既取决于生成输出也受模型训练意图影响,且GPT-4o对自身模型生成内容能提供一致但不够具体的解释。
English Summary: The abstract introduces AI-FAKER, a large multimodal dataset addressing gaps in detecting and attributing AI-generated content, revealing that detection depends on both output and training intent, and GPT-4o offers consistent but less specific explanations for its own models' content.

Authors:Yao Xiao, Tingfa Xu, Yu Xin, Jianan Li
Title: FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection
Abstract:
Embedded flight devices with visual capabilities have become essential for a wide range of applications. In aerial image detection, while many existing methods have partially addressed the issue of small target detection, challenges remain in optimizing small target detection and balancing detection accuracy with efficiency. These issues are key obstacles to the advancement of real-time aerial image detection. In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO, to address the imbalance between detection accuracy and efficiency. Our method comprises two lightweight modules: Feature Complementary Mapping Module (FCM) and Multi-Kernel Perception Unit(MKP), designed to enhance object perception for small targets in aerial images. FCM focuses on alleviating the problem of information imbalance caused by the loss of small target information in deep networks. It aims to integrate spatial positional information of targets more deeply into the network,better aligning with semantic information in the deeper layers to improve the localization of small targets. We introduce MKP, which leverages convolutions with kernels of different sizes to enhance the relationships between targets of various scales and improve the perception of targets at different scales. Extensive experimental results on three major aerial image datasets, including Visdrone, UAVDT, and AI-TOD,demonstrate that FBRT-YOLO outperforms various real-time detectors in terms of performance and speed.
中文: 本文提出FBRT-YOLO实时航拍图像检测器,通过特征互补映射模块和多核感知单元增强小目标的信息整合与多尺度感知能力,在多个数据集上实现了性能与速度的双重提升。
English: This paper introduces FBRT-YOLO, a real-time aerial image detector featuring two lightweight modules—FCM and MKP—that enhance small target detection by improving information integration and multi-scale perception, achieving superior performance and speed on benchmark datasets.

Authors:Qijun Jiang, Xiaodan Shao, Rui Zhang
Title: Statistical Channel Based Low-Complexity Rotation and Position Optimization for 6D Movable Antennas Enabled Wireless Communication
Abstract:
Six-dimensional movable antenna (6DMA) is a promising technology to fully exploit spatial variation in wireless channels by allowing flexible adjustment of three-dimensional (3D) positions and rotations of antennas at the transceiver. In this paper, we investigate the practical low-complexity design of 6DMA-enabled communication systems, including transmission protocol, statistical channel information (SCI) acquisition, and joint position and rotation optimization of 6DMA surfaces based on the SCI of users. Specifically, an orthogonal matching pursuit (OMP)-based algorithm is proposed for the estimation of SCI of users at all possible position-rotation pairs of 6DMA surfaces based on the channel measurements at a small subset of position-rotation pairs. Then, the average sum logarithmic rate of all users is maximized by jointly designing the positions and rotations of 6DMA surfaces based on their SCI acquired. Different from prior works on 6DMA which adopt alternating optimization to design 6DMA positions/rotations with iterations, we propose a new sequential optimization approach that first determines 6DMA rotations and then finds their feasible positions to realize the optimized rotations subject to practical antenna placement constraints. Simulation results show that the proposed sequential optimization significantly reduces the computational complexity of conventional alternating optimization, while achieving comparable communication performance. It is also shown that the proposed SCI-based 6DMA design can effectively enhance the communication throughput of wireless networks over existing fixed (position and rotation) antenna arrays, yet with a practically appealing low-complexity implementation.
中文摘要:本文针对六维可移动天线系统提出了一种低复杂度的顺序优化方法,先确定天线旋转角度再寻找可行位置,相比传统交替优化方法大幅降低了计算复杂度,同时保持了相当的通信性能,并能有效提升无线网络吞吐量。
English Summary: This paper presents a low-complexity sequential optimization approach for 6DMA systems that first determines antenna rotations before finding feasible positions, achieving comparable performance to conventional methods with significantly reduced computational complexity while enhancing network throughput over fixed antenna arrays.

Authors:Ali Alfageeh, Sadegh AlMahdi Kazemi Zarkouei, Daye Nam, Daniel Prol, Matin Amoozadeh, Souti Chattopadhyay, James Prather, Paul Denny, Juho Leinonen, Michael Hilton, Sruti Srinivasa Ragavan, Mohammad Amin Alipour
Title: From Prompts to Propositions: A Logic-Based Lens on Student-LLM Interactions
Abstract:
Background and Context. The increasing integration of large language models (LLMs) in computing education presents an emerging challenge in understanding how students use LLMs and craft prompts to solve computational tasks. Prior research has used both qualitative and quantitative methods to analyze prompting behavior, but these approaches lack scalability or fail to effectively capture the semantic evolution of prompts. Objective. In this paper, we investigate whether students prompts can be systematically analyzed using propositional logic constraints. We examine whether this approach can identify patterns in prompt evolution, detect struggling students, and provide insights into effective and ineffective strategies. Method. We introduce Prompt2Constraints, a novel method that translates students prompts into logical constraints. The constraints are able to represent the intent of the prompts in succinct and quantifiable ways. We used this approach to analyze a dataset of 1,872 prompts from 203 students solving introductory programming tasks. Findings. We find that while successful and unsuccessful attempts tend to use a similar number of constraints overall, when students fail, they often modify their prompts more significantly, shifting problem-solving strategies midway. We also identify points where specific interventions could be most helpful to students for refining their prompts. Implications. This work offers a new and scalable way to detect students who struggle in solving natural language programming tasks. This work could be extended to investigate more complex tasks and integrated into programming tools to provide real-time support.
中文: 本研究提出Prompt2Constraints方法,通过将学生提示转换为逻辑约束来分析提示演化模式并识别学习困难者,为检测自然语言编程任务中的困难提供了可扩展的新途径。
English: This study introduces Prompt2Constraints, a method that translates student prompts into logical constraints to analyze patterns in prompt evolution and identify struggling students, offering a scalable approach for detecting difficulties in natural language programming tasks.

Authors:Jiahao Huang, Fanwen Wang, Pedro F. Ferreira, Haosen Zhang, Yinzhe Wu, Zhifan Gao, Lei Zhu, Angelica I. Aviles-Rivero, Carola-Bibiane Schonlieb, Andrew D. Scott, Zohya Khalique, Maria Dwornik, Ramyah Rajakulasingam, Ranil De Silva, Dudley J. Pennell, Guang Yang, Sonia Nielles-Vallespin
Title: RSFR: A Coarse-to-Fine Reconstruction Framework for Diffusion Tensor Cardiac MRI with Semantic-Aware Refinement
Abstract:
Cardiac diffusion tensor imaging (DTI) offers unique insights into cardiomyocyte arrangements, bridging the gap between microscopic and macroscopic cardiac function. However, its clinical utility is limited by technical challenges, including a low signal-to-noise ratio, aliasing artefacts, and the need for accurate quantitative fidelity. To address these limitations, we introduce RSFR (Reconstruction, Segmentation, Fusion & Refinement), a novel framework for cardiac diffusion-weighted image reconstruction. RSFR employs a coarse-to-fine strategy, leveraging zero-shot semantic priors via the Segment Anything Model and a robust Vision Mamba-based reconstruction backbone. Our framework integrates semantic features effectively to mitigate artefacts and enhance fidelity, achieving state-of-the-art reconstruction quality and accurate DT parameter estimation under high undersampling rates. Extensive experiments and ablation studies demonstrate the superior performance of RSFR compared to existing methods, highlighting its robustness, scalability, and potential for clinical translation in quantitative cardiac DTI.
中文: RSFR框架通过结合语义分割与基于Vision Mamba的骨干网络,显著提升了心脏扩散张量成像的重建质量,并在高欠采样率下实现了精准的参数估计。
English: The RSFR framework enhances cardiac diffusion tensor imaging by combining semantic segmentation with a Vision Mamba-based backbone, achieving superior reconstruction quality and accurate parameter estimation under high undersampling rates.

Authors:Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu
Title: Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation
Abstract:
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
中文摘要:提出的DICE-Talk框架通过跨模态注意力机制和可学习情感库实现身份与情感的分离建模,在保持口型同步和身份特征的同时,显著提升了情感表达的准确性和自然度。
English Summary: The proposed DICE-Talk framework addresses emotional talking head generation by disentangling identity from emotion through cross-modal attention and learnable emotion banks, achieving superior emotion accuracy while maintaining lip synchronization and identity preservation.

Authors:Guoquan Wang, Qiang Luo, Weisong Hu, Pengfei Yao, Wencong Zeng, Guorui Zhou, Kun Gai
Title: FIM: Frequency-Aware Multi-View Interest Modeling for Local-Life Service Recommendation
Abstract:
People's daily lives involve numerous periodic behaviors, such as eating and traveling. Local-life platforms cater to these recurring needs by providing essential services tied to daily routines. Therefore, users' periodic intentions are reflected in their interactions with the platforms. There are two main challenges in modeling users' periodic behaviors in the local-life service recommendation systems: 1) the diverse demands of users exhibit varying periodicities, which are difficult to distinguish as they are mixed in the behavior sequences; 2) the periodic behaviors of users are subject to dynamic changes due to factors such as holidays and promotional events. Existing methods struggle to distinguish the periodicities of diverse demands and overlook the importance of dynamically capturing changes in users' periodic behaviors. To this end, we employ a Frequency-Aware Multi-View Interest Modeling framework (FIM). Specifically, we propose a multi-view search strategy that decomposes users' demands from different perspectives to separate their various periodic intentions. This allows the model to comprehensively extract their periodic features than category-searched-only methods. Moreover, we propose a frequency-domain perception and evolution module. This module uses the Fourier Transform to convert users' temporal behaviors into the frequency domain, enabling the model to dynamically perceive their periodic features. Extensive offline experiments demonstrate that FIM achieves significant improvements on public and industrial datasets, showing its capability to effectively model users' periodic intentions. Furthermore, the model has been deployed on the Kuaishou local-life service platform. Through online A/B experiments, the transaction volume has been significantly improved.
中文摘要:FIM框架通过多视角需求分解和频域感知技术,有效解决用户周期性行为建模中的多样性区分和动态变化问题,在本地生活服务平台中显著提升了交易量。
English Summary: The FIM framework addresses challenges in modeling users' periodic behaviors by employing multi-view demand decomposition and frequency-domain perception, achieving significant performance improvements in local-life service recommendations.

Authors:Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Hongfei Lin, Hamid Alinejad-Rokny, Shiwen Ni, Yuan Lin, Min Yang
Title: IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
Abstract:
Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench, covering 8 IP mechanisms and 20 tasks. This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation. We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy, revealing substantial room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. We publicly release all data and code of IPBench and will continue to update it with additional IP-related tasks to better reflect real-world challenges in the intellectual property domain.
Chinese: IPBench作为首个全面的双语基准被提出,用于评估大语言模型在现实知识产权场景中的表现,结果显示即使是表现最佳的DeepSeek-V3模型也仅有75.8%的准确率,存在明显改进空间。
English: IPBench is introduced as the first comprehensive bilingual benchmark to evaluate LLMs in real-world intellectual property scenarios, revealing that even top models like DeepSeek-V3 have significant room for improvement with only 75.8% accuracy.

Authors:Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Chengming Li, Hamid Alinejad-Rokny, Shiwen Ni, Yuan Lin, Min Yang
Title: IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
Abstract:
Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP-related tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce IPBench, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing 8 IP mechanisms and 20 distinct tasks, designed to evaluate LLMs in real-world IP scenarios. We benchmark 17 main LLMs, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data and code in the supplementary URLs.
Chinese: IPBench作为首个全面的双语基准被提出,用于评估大语言模型在现实知识产权场景中的表现,结果显示即使是表现最佳的DeepSeek-V3模型也仅有75.8%的准确率,存在明显改进空间。
English: IPBench is introduced as the first comprehensive bilingual benchmark to evaluate LLMs in real-world intellectual property scenarios, revealing that even top models like DeepSeek-V3 have significant room for improvement with only 75.8% accuracy.

Authors:Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, Yunhe Wang
Title: DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution
Abstract:
Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation. To this end, we are the first to introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks to effectively enhance the alignment of generated outputs with human preferences. Specifically, we introduce Direct Preference Optimization (DPO) into Real-ISR to achieve alignment, where DPO serves as a general alignment technique that directly learns from the human preference dataset. Nevertheless, unlike high-level tasks, the pixel-level reconstruction objectives of Real-ISR are difficult to reconcile with the image-level preferences of DPO, which can lead to the DPO being overly sensitive to local anomalies, leading to reduced generation quality. To resolve this dichotomy, we propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance, which is through two strategies: (a) semantic instance alignment strategy, implementing instance-level alignment to ensure fine-grained perceptual consistency, and (b) user description feedback strategy, mitigating hallucinations through semantic textual feedback on instance-level images. As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.
中文摘要:针对真实世界图像超分辨率中扩散模型缺乏人类反馈集成的问题,本研究首次提出直接语义偏好优化方法,通过语义实例对齐和用户描述反馈策略实现像素级重建与图像级偏好的协同,有效提升生成质量与人类偏好的一致性。
English Summary: Recent diffusion models for Real-World Image Super-Resolution (Real-ISR) lack human feedback integration, leading to potential artifacts and misalignment with human preferences, which this work addresses by introducing Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences through semantic guidance.

Authors:Shangde Gao, Ke Liu, Yichao Fu, Hongxia Xu, Jian Wu
Title: Matrix Factorization with Dynamic Multi-view Clustering for Recommender System
Abstract:
Matrix factorization (MF), a cornerstone of recommender systems, decomposes user-item interaction matrices into latent representations. Traditional MF approaches, however, employ a two-stage, non-end-to-end paradigm, sequentially performing recommendation and clustering, resulting in prohibitive computational costs for large-scale applications like e-commerce and IoT, where billions of users interact with trillions of items. To address this, we propose Matrix Factorization with Dynamic Multi-view Clustering (MFDMC), a unified framework that balances efficient end-to-end training with comprehensive utilization of web-scale data and enhances interpretability. MFDMC leverages dynamic multi-view clustering to learn user and item representations, adaptively pruning poorly formed clusters. Each entity's representation is modeled as a weighted projection of robust clusters, capturing its diverse roles across views. This design maximizes representation space utilization, improves interpretability, and ensures resilience for downstream tasks. Extensive experiments demonstrate MFDMC's superior performance in recommender systems and other representation learning domains, such as computer vision, highlighting its scalability and versatility.
中文: 矩阵分解与动态多视图聚类(MFDMC)作为一种高效的端到端框架,通过自适应学习用户和项目的鲁棒聚类投影,提升了可扩展性和可解释性,在推荐系统及其他领域超越了传统方法的性能。
English: Matrix factorization with dynamic multi-view clustering (MFDMC) is introduced as an efficient end-to-end framework that enhances scalability and interpretability by adaptively learning user and item representations through robust cluster projections, outperforming traditional methods in recommender systems and other domains.

Authors:Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi
Title: ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data
Abstract:
Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).
中文: ParaPO是一种微调方法,能有效减少语言模型对训练数据的无意识逐字复现,同时保持模型实用性,其通过系统提示的变体可保留对名言的恰当引用能力。
English: ParaPO is a fine-tuning method that effectively reduces language models' unintentional verbatim regurgitation from training data while maintaining their utility, with a variant using system prompts to preserve appropriate quotation recall.

Authors:Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
Title: ToolRL: Reward is All Tool Learning Needs
Abstract:
Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.
中文摘要:本研究提出了一种基于强化学习的奖励设计原则,通过系统分析奖励策略并采用GRPO训练方法,显著提升了大型语言模型的工具使用能力,在多项基准测试中优于基础模型和SFT模型。
English Summary: This study introduces a principled reward design for reinforcement learning to enhance LLMs' tool use, achieving significant performance gains over base and SFT models through systematic reward strategy analysis and GRPO training.

Authors:Jie Zou, Cheng Lin, Weikang Guo, Zheng Wang, Jiwei Wei, Yang Yang, Heng Tao Shen
Title: Multi-Type Context-Aware Conversational Recommender Systems via Mixture-of-Experts
Abstract:
Conversational recommender systems enable natural language conversations and thus lead to a more engaging and effective recommendation scenario. As the conversations for recommender systems usually contain limited contextual information, many existing conversational recommender systems incorporate external sources to enrich the contextual information. However, how to combine different types of contextual information is still a challenge. In this paper, we propose a multi-type context-aware conversational recommender system, called MCCRS, effectively fusing multi-type contextual information via mixture-of-experts to improve conversational recommender systems. MCCRS incorporates both structured information and unstructured information, including the structured knowledge graph, unstructured conversation history, and unstructured item reviews. It consists of several experts, with each expert specialized in a particular domain (i.e., one specific contextual information). Multiple experts are then coordinated by a ChairBot to generate the final results. Our proposed MCCRS model takes advantage of different contextual information and the specialization of different experts followed by a ChairBot breaks the model bottleneck on a single contextual information. Experimental results demonstrate that our proposed MCCRS method achieves significantly higher performance compared to existing baselines.
Chinese: 本文提出MCCRS,一种多类型上下文感知对话推荐系统,通过专家混合方法有效融合多种上下文信息,显著优于现有基线方法。
English: This paper introduces MCCRS, a multi-type context-aware conversational recommender system that effectively integrates various contextual information through a mixture-of-experts approach, significantly outperforming existing baselines.

Authors:Tao He, Lizi Liao, Ming Liu, Bing Qin
Title: Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning
Abstract:
Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real-world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task-specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user-tailored dialogue policy planning. Building on this foundation, we present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non-collaborative settings, demonstrate the effectiveness of UDP in learning user-specific dialogue strategies. Results validate the protocol's utility and highlight UDP's robustness, adaptability, and potential to advance user-centric dialogue systems.
Chinese Summary: 本研究提出用户定制对话策略规划(UDP)框架,通过建立用户特征模型与反馈机制解决现有对话系统的局限性,综合实验证明其在多样化交互场景中具有卓越的适应能力。
English Summary: The study introduces the User-Tailored Dialogue Policy Planning (UDP) framework to address limitations in existing dialogue systems by modeling user traits and feedback, demonstrating superior adaptability in diverse interaction scenarios through comprehensive experiments.

Authors:Ying Wang, Tingfa Xu, Jianan Li
Title: FocusTrack: A Self-Adaptive Local Sampling Algorithm for Efficient Anti-UAV Tracking
Abstract:
Anti-UAV tracking poses significant challenges, including small target sizes, abrupt camera motion, and cluttered infrared backgrounds. Existing tracking paradigms can be broadly categorized into global- and local-based methods. Global-based trackers, such as SiamDT, achieve high accuracy by scanning the entire field of view but suffer from excessive computational overhead, limiting real-world deployment. In contrast, local-based methods, including OSTrack and ROMTrack, efficiently restrict the search region but struggle when targets undergo significant displacements due to abrupt camera motion. Through preliminary experiments, it is evident that a local tracker, when paired with adaptive search region adjustment, can significantly enhance tracking accuracy, narrowing the gap between local and global trackers. To address this challenge, we propose FocusTrack, a novel framework that dynamically refines the search region and strengthens feature representations, achieving an optimal balance between computational efficiency and tracking accuracy. Specifically, our Search Region Adjustment (SRA) strategy estimates the target presence probability and adaptively adjusts the field of view, ensuring the target remains within focus. Furthermore, to counteract feature degradation caused by varying search regions, the Attention-to-Mask (ATM) module is proposed. This module integrates hierarchical information, enriching the target representations with fine-grained details. Experimental results demonstrate that FocusTrack achieves state-of-the-art performance, obtaining 67.7% AUC on AntiUAV and 62.8% AUC on AntiUAV410, outperforming the baseline tracker by 8.5% and 9.1% AUC, respectively. In terms of efficiency, FocusTrack surpasses global-based trackers, requiring only 30G MACs and achieving 143 fps with FocusTrack (SRA) and 44 fps with the full version, both enabling real-time tracking.
中文: 反无人机跟踪面临小目标和快速移动等挑战,FocusTrack通过动态调整搜索区域和增强特征表示,在效率和精度间取得平衡,实现了顶尖性能。
English: Anti-UAV tracking faces challenges like small targets and abrupt motion, addressed by FocusTrack, which dynamically adjusts search regions and enhances features to balance efficiency and accuracy, achieving state-of-the-art results.

Authors:Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie
Title: $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
Abstract:
We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
中文: 我们推出了$\texttt{Complex-Edit}$基准,用于评估基于指令的图像编辑模型在不同复杂度下的表现,揭示了开源与专有模型间的性能差距及合成数据对输出质量的影响等关键发现。
English: We introduce $\texttt{Complex-Edit}$, a benchmark for evaluating instruction-based image editing models across complexity levels, revealing key insights such as performance gaps between open-source and proprietary models and the impact of synthetic data on output quality.

Authors:Mengying Yuan, Wenhao Wang, Zixuan Wang, Yujie Huang, Kangli Wei, Fei Li, Chong Teng, Donghong Ji
Title: Cross-Document Cross-Lingual NLI via RST-Enhanced Graph Fusion and Interpretability Prediction
Abstract:
Natural Language Inference (NLI) is a fundamental task in natural language processing. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm: CDCL-NLI, which extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 25,410 instances and spanning 26 languages. To address the limitations of previous methods on CDCL-NLI task, we further propose an innovative method that integrates RST-enhanced graph fusion with interpretability-aware prediction. Our approach leverages RST (Rhetorical Structure Theory) within heterogeneous graph neural networks for cross-document context modeling, and employs a structure-aware semantic alignment based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU (Elementary Discourse Unit)-level attribution framework that produces extractive explanations. Extensive experiments demonstrate our approach's superior performance, achieving significant improvements over both conventional NLI models as well as large language models. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, hallucination elimination and interpretability inference. Our code and datasets are available at \href{https://anonymous.4open.science/r/CDCL-NLI-637E/}{CDCL-NLI-link} for peer review.
中文摘要:本文提出了跨文档跨语言自然语言推理(CDCL-NLI)的新范式,通过结合RST增强图融合与可解释性预测的方法,并构建多语言数据集,显著提升了跨文档跨语言语境下的推理性能。
English Summary: This paper introduces Cross-Document Cross-Lingual Natural Language Inference (CDCL-NLI), proposing a novel method combining RST-enhanced graph fusion with interpretability-aware prediction and creating a multilingual dataset to advance cross-document, cross-lingual understanding.

Authors:Huizhe Zhang, Jintang Li, Yuchang Zhu, Liang Chen, Zibin Zheng
Title: GT-SVQ: A Linear-Time Graph Transformer for Node Classification Using Spiking Vector Quantization
Abstract:
Graph Transformers (GTs), which simultaneously integrate message-passing and self-attention mechanisms, have achieved promising empirical results in some graph prediction tasks. Although these approaches show the potential of Transformers in capturing long-range graph topology information, issues concerning the quadratic complexity and high computing energy consumption severely limit the scalability of GTs on large-scale graphs. Recently, as brain-inspired neural networks, Spiking Neural Networks (SNNs), facilitate the development of graph representation learning methods with lower computational and storage overhead through the unique event-driven spiking neurons. Inspired by these characteristics, we propose a linear-time Graph Transformer using Spiking Vector Quantization (GT-SVQ) for node classification. GT-SVQ reconstructs codebooks based on rate coding outputs from spiking neurons, and injects the codebooks into self-attention blocks to aggregate global information in linear complexity. Besides, spiking vector quantization effectively alleviates codebook collapse and the reliance on complex machinery (distance measure, auxiliary loss, etc.) present in previous vector quantization-based graph learning methods. In experiments, we compare GT-SVQ with other state-of-the-art baselines on node classification datasets ranging from small to large. Experimental results show that GT-SVQ has achieved competitive performances on most datasets while maintaining up to 130x faster inference speed compared to other GTs.
中文摘要:提出的GT-SVQ模型将脉冲神经网络与图Transformer相结合,通过线性复杂度的向量量化实现高效节点分类,在保持竞争力的分类性能同时显著提升了推理速度。
English Summary: The proposed GT-SVQ model combines spiking neural networks with graph transformers to achieve efficient node classification through linear-complexity vector quantization, delivering competitive performance with significantly faster inference speeds.

Authors:Paul Denny, Viraj Kumar, Stephen MacNeil, James Prather, Juho Leinonen
Title: Probing the Unknown: Exploring Student Interactions with Probeable Problems at Scale in Introductory Programming
Abstract:
Introductory programming courses often rely on small code-writing exercises that have clearly specified problem statements. This limits opportunities for students to practice how to clarify ambiguous requirements -- a critical skill in real-world programming. In addition, the emerging capabilities of large language models (LLMs) to produce code from well-defined specifications may harm student engagement with traditional programming exercises. This study explores the use of ``Probeable Problems'', automatically gradable tasks that have deliberately vague or incomplete specifications. Such problems require students to submit test inputs, or `probes', to clarify requirements before implementation. Through analysis of over 40,000 probes in an introductory course, we identify patterns linking probing behaviors to task success. Systematic strategies, such as thoroughly exploring expected behavior before coding, resulted in fewer incorrect code submissions and correlated with course success. Feedback from nearly 1,000 participants highlighted the challenges and real-world relevance of these tasks, as well as benefits to critical thinking and metacognitive skills. Probeable Problems are easy to set up and deploy at scale, and help students recognize and resolve uncertainties in programming problems.
中文摘要:本研究提出“可探询问题”——一种具有模糊规格的可自动评分编程任务,要求学生通过提交测试探针来澄清需求,在入门课程中有效提升了批判性思维并减少了编码错误。
English Summary: This study introduces "Probeable Problems," gradable programming tasks with vague specifications that require students to submit test probes for clarification, which improved critical thinking and reduced coding errors in an introductory course.

Authors:Guocong Li, Weize Liu, Yihang Wu, Ping Wang, Shuaihan Huang, Hongxia Xu, Jian Wu
Title: From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs
Abstract:
Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering~(QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information.
中文: 本文提出一种三阶段微调方法,通过训练大语言模型识别、修正输入中的误导信息,显著提升了多个实验数据集上的回答准确率并减少了幻觉生成。
English: This paper introduces a three-stage fine-tuning method that enhances large language models' ability to detect and correct misleading input information, significantly improving response accuracy and reducing hallucinations across multiple experimental datasets.

Authors:Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu
Title: ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings
Abstract:
The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
中文:ProtFlow是一种基于快速流匹配的框架,通过利用蛋白质语言模型的压缩潜在空间来增强蛋白质序列设计,实现高效的单步生成,并在多种应用中超越现有方法。
English: ProtFlow is a fast flow matching-based framework that enhances protein sequence design by leveraging compressed latent space from protein language models, enabling efficient single-step generation and outperforming existing methods across various applications.

Authors:Zheyuan Zhang, Monica Dou, Linkai Peng, Hongyi Pan, Ulas Bagci, Boqing Gong
Title: VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro
Abstract:
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.
中文:本文推出了首个针对广告视频的多模态大模型评测数据集VideoAds,实验表明开源模型在视频摘要和推理任务上优于闭源模型,但所有模型性能均远低于人类专家水平,凸显了提升时序建模能力对视频理解的必要性。
English: This paper introduces VideoAds, the first dataset specifically designed to benchmark multimodal large language models on complex advertisement videos, revealing that while open-source models outperform proprietary ones in summarization and reasoning tasks, all models significantly trail human performance, highlighting the need for improved temporal modeling in video understanding.

Authors:Mohammad A. A. K. Jalwana, Naveed Akhtar, Ajmal Mian, Nazanin Rahnavard, Mubarak Shah
Title: On Transfer-based Universal Attacks in Pure Black-box Setting
Abstract:
Despite their impressive performance, deep visual models are susceptible to transferable black-box adversarial attacks. Principally, these attacks craft perturbations in a target model-agnostic manner. However, surprisingly, we find that existing methods in this domain inadvertently take help from various priors that violate the black-box assumption such as the availability of the dataset used to train the target model, and the knowledge of the number of classes in the target model. Consequently, the literature fails to articulate the true potency of transferable black-box attacks. We provide an empirical study of these biases and propose a framework that aids in a prior-free transparent study of this paradigm. Using our framework, we analyze the role of prior knowledge of the target model data and number of classes in attack performance. We also provide several interesting insights based on our analysis, and demonstrate that priors cause overestimation in transferability scores. Finally, we extend our framework to query-based attacks. This extension inspires a novel image-blending technique to prepare data for effective surrogate model training.
中文: 深度视觉模型易受可迁移黑盒对抗攻击,但现有方法依赖违反黑盒假设的先验知识,导致迁移性被高估;本研究提出无先验透明评估框架,并开发图像融合技术以提升代理模型训练效果。
English: Deep visual models are vulnerable to transferable black-box adversarial attacks, but existing methods rely on priors that violate black-box assumptions, leading to overestimated transferability; this study introduces a prior-free framework for transparent evaluation and proposes an image-blending technique for improved surrogate training.

Authors:Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal
Title: Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Abstract:
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
中文摘要:Video-MSG方法提出了一种无需训练的文本到视频生成引导技术,通过多模态规划和结构化噪声初始化创建精细的视频草图,无需推理时微调或额外内存即可提升文本遵循能力。
English Summary: The Video-MSG method introduces a training-free guidance approach for text-to-video generation that uses multimodal planning and structured noise initialization to create detailed video sketches, enabling improved text alignment without requiring fine-tuning or additional memory during inference.

Authors:Zhao Dong, Ka Chen, Zhaoyang Lv, Hong-Xing Yu, Yunzhi Zhang, Cheng Zhang, Yufeng Zhu, Stephen Tian, Zhengqin Li, Geordie Moffatt, Sean Christofferson, James Fort, Xiaqing Pan, Mingfei Yan, Jiajun Wu, Carl Yuheng Ren, Richard Newcombe
Title: Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset
Abstract:
We introduce the Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin-quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. The DTC dataset is already released at https://www.projectaria.com/datasets/dtc/ and we will also make the baseline evaluations open-source.
Chinese: 数字孪生目录(DTC)推出了一个包含2000个扫描数字孪生对象的大规模逼真3D数据集,结合单反相机和AR眼镜采集的图像,为评估和改进3D重建方法建立了首个全面基准。
English: The Digital Twin Catalog (DTC) introduces a large-scale photorealistic 3D object dataset with 2,000 scanned digital twins and imagery from DSLR cameras and AR glasses, establishing the first comprehensive benchmark for evaluating and improving 3D reconstruction methods.

Authors:Tommaso Galliena, Tommaso Apicella, Stefano Rosa, Pietro Morerio, Alessio Del Bue, Lorenzo Natale
Title: Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Abstract:
We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/
Chinese: 本文提出一种自监督框架,通过共识机制提炼伪描述并使用对比学习微调模型,显著提升了智能体在主动探索环境中描述物体的准确性和一致性。
English: This paper introduces a self-supervised framework that enhances object description accuracy and consistency in active exploration by distilling pseudo-captions through consensus and fine-tuning captioning models with contrastive learning.

Authors:Yi Huang, Ke Zhang, Wei Liu, Yuanyuan Wang, Vishal M. Patel, Le Lu, Xu Han, Dakai Jin, Ke Yan
Title: HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
Abstract:
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability.
中文摘要:提出的HarmonySeg框架通过多尺度感受野、血管特征图增强小结构识别以及拓扑保持损失函数处理标注缺陷,在多个数据集上实现了对医学图像中管状结构的精确分割,性能优于现有先进方法。
English Summary: The proposed HarmonySeg framework effectively segments tubular structures in medical images by incorporating multi-scale receptive fields, vesselness maps for enhanced recall, and a topology-preserving loss to handle annotation challenges, outperforming existing methods across multiple datasets.

Authors:Shujin Wu, Cheng Qian, Yi R. Fung, Paul Pu Liang, Heng Ji
Title: Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization
Abstract:
The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students from employing their knowledge during training and reaching their full potential. In this work, we introduce Alice (pro{A}ctive {l}earning w{i}th tea{c}her's D{e}monstrations), a framework that leverages complementary knowledge between teacher and student to enhance the learning process. We probe the knowledge base of the teacher model by eliciting their uncertainty, and then use these insights together with teachers' responses as demonstrations to guide student models in self-generating improved responses for supervision. In addition, for situations with significant capability gaps between teacher and student models, we introduce cascade Alice, which employs a hierarchical training approach where weak teachers initially supervise intermediate models, who then guide stronger models in sequence. Experimental results demonstrate that our method significantly enhances the W2SG performance, yielding substantial improvements in three key tasks compared to the original W2SG: knowledge-based reasoning (+4.0%), mathematical reasoning (+22.62%), and logical reasoning (+12.11%). This highlights the effectiveness of our new W2SG paradigm that enables more robust knowledge transfer and supervision outcome.
中文: Alice是一个主动学习框架,通过利用教师模型的不确定性和示范来指导学生自我生成改进的响应,从而增强弱到强泛化能力,显著提升了推理任务的性能。
English: Alice is a proactive learning framework that enhances weak-to-strong generalization by leveraging teacher uncertainty and demonstrations to guide students in self-generating improved responses, significantly boosting performance in reasoning tasks.

Authors:Gustavo Moreira, Edyta Paulina Bogucka, Marios Constantinides, Daniele Quercia
Title: The Hall of AI Fears and Hopes: Comparing the Views of AI Influencers and those of Members of the U.S. Public Through an Interactive Platform
Abstract:
AI development is shaped by academics and industry leaders - let us call them ``influencers'' - but it is unclear how their views align with those of the public. To address this gap, we developed an interactive platform that served as a data collection tool for exploring public views on AI, including their fears, hopes, and overall sense of hopefulness. We made the platform available to 330 participants representative of the U.S. population in terms of age, sex, ethnicity, and political leaning, and compared their views with those of 100 AI influencers identified by Time magazine. The public fears AI getting out of control, while influencers emphasize regulation, seemingly to deflect attention from their alleged focus on monetizing AI's potential. Interestingly, the views of AI influencers from underrepresented groups such as women and people of color often differ from the views of underrepresented groups in the public.
中文: 研究发现AI领域意见领袖与公众观点存在差异,公众担忧AI失控而意见领袖强调监管,且不同群体内部观点也存在显著分歧。
English: This study reveals a disconnect between AI influencers and the public, with the public fearing AI's uncontrollability while influencers focus on regulation, and highlights differing views within underrepresented groups.

Authors:Bing Han, Feifei Zhao, Yinqian Sun, Wenxuan Pan, Yi Zeng
Title: Continual Learning of Multiple Cognitive Functions with Brain-inspired Temporal Development Mechanism
Abstract:
Cognitive functions in current artificial intelligence networks are tied to the exponential increase in network scale, whereas the human brain can continuously learn hundreds of cognitive functions with remarkably low energy consumption. This advantage is in part due to the brain cross-regional temporal development mechanisms, where the progressive formation, reorganization, and pruning of connections from basic to advanced regions, facilitate knowledge transfer and prevent network redundancy. Inspired by these, we propose the Continual Learning of Multiple Cognitive Functions with Brain-inspired Temporal Development Mechanism(TD-MCL), enabling cognitive enhancement from simple to complex in Perception-Motor-Interaction(PMI) multiple cognitive task scenarios. The TD-MCL model proposes the sequential evolution of long-range connections between different cognitive modules to promote positive knowledge transfer, while using feedback-guided local connection inhibition and pruning to effectively eliminate redundancies in previous tasks, reducing energy consumption while preserving acquired knowledge. Experiments show that the proposed method can achieve continual learning capabilities while reducing network scale, without introducing regularization, replay, or freezing strategies, and achieving superior accuracy on new tasks compared to direct learning. The proposed method shows that the brain's developmental mechanisms offer a valuable reference for exploring biologically plausible, low-energy enhancements of general cognitive abilities.
中文摘要:提出的TD-MCL模型模拟大脑时序发育机制,通过跨模块长连接演化和局部连接剪枝,在多认知任务中实现持续学习,在降低网络规模与能耗的同时获得更优性能。
English Summary: The proposed TD-MCL model mimics the brain's temporal development mechanisms to enable continual learning across multiple cognitive tasks while reducing network redundancy and energy consumption, achieving superior performance without traditional regularization methods.

Authors:Yueyang Liu, Jiangxia Cao, Shen Wang, Shuang Wen, Xiang Chen, Xiangyu Wu, Shuang Yang, Zhaojie Liu, Kun Gai, Guorui Zhou
Title: LLM-Alignment Live-Streaming Recommendation
Abstract:
In recent years, integrated short-video and live-streaming platforms have gained massive global adoption, offering dynamic content creation and consumption. Unlike pre-recorded short videos, live-streaming enables real-time interaction between authors and users, fostering deeper engagement. However, this dynamic nature introduces a critical challenge for recommendation systems (RecSys): the same live-streaming vastly different experiences depending on when a user watching. To optimize recommendations, a RecSys must accurately interpret the real-time semantics of live content and align them with user preferences.
中文: 随着短视频与直播融合平台的兴起,推荐系统面临新挑战:必须准确解析直播内容的实时语义,并根据用户偏好进行匹配,因为同一直播在不同观看时段会带来截然不同的体验。
English: The rise of integrated short-video and live-streaming platforms presents a unique challenge for recommendation systems, requiring them to interpret real-time content semantics and match them with user preferences due to the varying experiences of live streams at different viewing times.

Authors:Jie Wang, Tingfa Xu, Lihe Ding, Xinjie Zhang, Long Bai, Jianan Li
Title: PvNeXt: Rethinking Network Design and Temporal Motion for Point Cloud Video Recognition
Abstract:
Point cloud video perception has become an essential task for the realm of 3D vision. Current 4D representation learning techniques typically engage in iterative processing coupled with dense query operations. Although effective in capturing temporal features, this approach leads to substantial computational redundancy. In this work, we propose a framework, named as PvNeXt, for effective yet efficient point cloud video recognition, via personalized one-shot query operation. Specially, PvNeXt consists of two key modules, the Motion Imitator and the Single-Step Motion Encoder. The former module, the Motion Imitator, is designed to capture the temporal dynamics inherent in sequences of point clouds, thus generating the virtual motion corresponding to each frame. The Single-Step Motion Encoder performs a one-step query operation, associating point cloud of each frame with its corresponding virtual motion frame, thereby extracting motion cues from point cloud sequences and capturing temporal dynamics across the entire sequence. Through the integration of these two modules, {PvNeXt} enables personalized one-shot queries for each frame, effectively eliminating the need for frame-specific looping and intensive query processes. Extensive experiments on multiple benchmarks demonstrate the effectiveness of our method.
中文: PvNeXt通过运动模拟器和单步运动编码器模块实现个性化单次查询,有效捕捉点云视频时序特征,避免了迭代处理带来的计算冗余。
English: PvNeXt introduces a personalized one-shot query framework using Motion Imitator and Single-Step Motion Encoder modules to efficiently capture temporal dynamics in point cloud videos, eliminating computational redundancy from iterative processing.

Authors:Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Title: Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Abstract:
Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local "safety regions" in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts--a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.
中文摘要:当前对齐方法仅在大型语言模型中建立表面安全,预训练阶段嵌入的有害知识依然存在,并能通过分布偏移下的对抗性提示完全激活,揭示了模型的普遍脆弱性。
English Summary: Current alignment methods only create superficial safety in large language models, as harmful knowledge from pretraining persists and can be fully reactivated through adversarial prompts under distributional shifts.

Authors:Sheng Zheng, Chaoning Zhang, Dongshen Han, Fachrina Dewi Puspitasari, Xinhong Hao, Yang Yang, Heng Tao Shen
Title: Exploring Kernel Transformations for Implicit Neural Representations
Abstract:
Implicit neural representations (INRs), which leverage neural networks to represent signals by mapping coordinates to their corresponding attributes, have garnered significant attention. They are extensively utilized for image representation, with pixel coordinates as input and pixel values as output. In contrast to prior works focusing on investigating the effect of the model's inside components (activation function, for instance), this work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible computation overhead. Moreover, we present two perspectives, depth and normalization, to interpret the performance benefits caused by scale and shift transformation. Overall, our work provides a new avenue for future works to understand and improve INR through the lens of kernel transformation.
Chinese Summary: 本研究通过引入输入和输出的核变换,提出了一种提升隐式神经表示性能的新方法,该方法在计算开销极低的情况下显著改善效果,并提供了深度和归一化两个视角来解释性能提升的原因。
English Summary: This study introduces a novel approach to enhancing implicit neural representations (INRs) by applying kernel transformations to inputs and outputs, which significantly improves performance with minimal computational cost, while also offering new perspectives through depth and normalization for understanding these gains.

Authors:Ruoyan Li, Zijie Huang, Haixin Wang, Guancheng Wan, Yizhou Sun, Wei Wang
Title: Self-Guided Diffusion Model for Accelerating Computational Fluid Dynamics
Abstract:
Machine learning methods, such as diffusion models, are widely explored as a promising way to accelerate high-fidelity fluid dynamics computation via a super-resolution process from faster-to-compute low-fidelity input. However, existing approaches usually make impractical assumptions that the low-fidelity data is down-sampled from high-fidelity data. In reality, low-fidelity data is produced by numerical solvers that use a coarser resolution. Solver-generated low-fidelity data usually sacrifices fine-grained details, such as small-scale vortices compared to high-fidelity ones. Our findings show that SOTA diffusion models struggle to reconstruct fine-scale details when faced with solver-generated low-fidelity inputs. To bridge this gap, we propose SG-Diff, a novel diffusion model for reconstruction, where both low-fidelity inputs and high-fidelity targets are generated from numerical solvers. We propose an \textit{Importance Weight} strategy during training that serves as a form of self-guidance, focusing on intricate fluid details, and a \textit{Predictor-Corrector-Advancer} SDE solver that embeds physical guidance into the diffusion sampling process. Together, these techniques steer the diffusion model toward more accurate reconstructions. Experimental results on four 2D turbulent flow datasets demonstrate the efficacy of \model~against state-of-the-art baselines.
中文:扩散模型难以从求解器生成的低精度输入中重建精细流体细节,但提出的SG-Diff模型通过重要性加权策略和采样过程中的物理引导,实现了精确的超分辨率重建。
English: Diffusion models struggle to reconstruct fine fluid details from solver-generated low-fidelity inputs, but the proposed SG-Diff model overcomes this with an importance weight strategy and physical guidance during sampling for accurate super-resolution reconstruction.

Authors:Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks
Title: One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Abstract:
Multi-modal retrieval augmented generation (M-RAG) is instrumental for inhibiting hallucinations in large multi-modal models (LMMs) through the use of a factual knowledge base (KB). However, M-RAG introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this paper, we present the first poisoning attack against M-RAG targeting visual document retrieval applications where the KB contains images of document pages. We propose two attacks, each of which require injecting only a single adversarial image into the KB. Firstly, we propose a universal attack that, for any potential user query, influences the response to cause a denial-of-service (DoS) in the M-RAG system. Secondly, we present a targeted attack against one or a group of user queries, with the goal of spreading targeted misinformation. For both attacks, we use a multi-objective gradient-based adversarial approach to craft the injected image while optimizing for both retrieval and generation. We evaluate our attacks against several visual document retrieval datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (LMMs), demonstrating the attack effectiveness in both the universal and targeted settings. We additionally present results including commonly used defenses, various attack hyper-parameter settings, ablations, and attack transferability.
Chinese: 本文首次提出针对多模态检索增强生成(M-RAG)系统的投毒攻击,通过向知识库注入单个对抗图像,实现在视觉文档检索应用中引发服务拒绝或传播定向虚假信息。
English: This paper introduces the first poisoning attacks against multi-modal retrieval augmented generation (M-RAG) systems, demonstrating how a single adversarial image injection can cause denial-of-service or spread targeted misinformation by exploiting vulnerabilities in visual document retrieval applications.

Authors:Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, Tomas Pfister
Title: On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
Abstract:
Agentic AI workflows (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low. A promising solution is inference-time alignment, which uses extra compute at test time to improve performance. Inference-time alignment relies on three components: sampling, evaluation, and feedback. While most prior work studies sampling and automatic evaluation, feedback remains underexplored. To study the role of feedback, we introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques (reward models or AI-generated textual feedback) between decoding steps. Through IAD, we analyze feedback along four dimensions: (1) its role in the accuracy-compute trade-offs with limited inference budget, (2) quantifying the gains over diversity-only baselines such as best-of-N sampling, (3) effectiveness of composing feedback from reward models versus textual critique, and (4) robustness to noisy or low-quality feedback. Across Sketch2Code, Text2SQL, Intercode, and WebShop, we show that IAD with proper integration of high fidelity feedback leads to consistent gains up to 10 percent absolute performance improvement over various baselines such as best-of-N. Our findings underscore feedback as a crucial knob for inference-time alignment of agentic AI workflows with limited inference budget.
中文摘要:通过迭代代理解码(IAD)在推理阶段整合高质量反馈,可使自主AI工作流在有限计算资源下实现最高10%的性能提升,证明反馈机制是优化推理时对齐效果的关键要素。
English Summary: Inference-time alignment through Iterative Agent Decoding (IAD) demonstrates that integrating high-quality feedback between decoding steps consistently improves agentic AI performance by up to 10% across diverse tasks, establishing feedback as a critical component for optimizing computational efficiency.

Authors:Yuejiao Su, Yi Wang, Qiongyang Hu, Chuang Yang, Lap-Pui Chau
Title: ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Abstract:
Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
中文摘要:本文提出Ego-IRG任务,通过文本和像素级响应全面理解自我中心交互,并创建Ego-IRGBench数据集及ANNEXE模型,实验证明其优于现有方法。
English Summary: This paper introduces the Ego-IRG task for comprehensive egocentric interaction understanding through textual and pixel-level responses, supported by the novel Ego-IRGBench dataset and ANNEXE model that outperforms existing methods.

Authors:Aman Sharma, Benoit Baudry, Martin Monperrus
Title: Causes and Canonicalization for Unreproducible Builds in Java
Abstract:
The increasing complexity of software supply chains and the rise of supply chain attacks have elevated concerns around software integrity. Users and stakeholders face significant challenges in validating that a given software artifact corresponds to its declared source. Reproducible Builds address this challenge by ensuring that independently performed builds from identical source code produce identical binaries. However, achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central, and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using OSS-Rebuild and jNorm respectively. Finally, we present Chains-Rebuild, a tool that achieve successfulcanonicalization for 26.60% on 12,803 unreproducible artifacts To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and Chains-Rebuild, a canonicalization tool for mitigating unreproducible builds in Java.
中文摘要:本研究针对Java软件中实现可重复构建的挑战,提出了概念框架、分析了不可重复性的根本原因,并开发了规范化工具,成功解决了26.60%不可重复构件的问题。
English Summary: This study addresses the challenge of achieving reproducible builds in Java software by introducing a conceptual framework, analyzing root causes of unreproducibility, and developing canonicalization tools that successfully mitigate 26.60% of unreproducible artifacts.

Authors:Paiheng Xu, Gang Wu, Xiang Chen, Tong Yu, Chang Xiao, Franck Dernoncourt, Tianyi Zhou, Wei Ai, Viswanathan Swaminathan
Title: Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Abstract:
Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset's diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.
中文: 该离线模拟框架利用大语言模型和脚本指南,通过任务生成和技能验证创建经过验证的软件专用脚本,相比运行时代码生成显著提高了自动化成功率,同时降低了响应时间和计算成本。
English: The proposed offline simulation framework leverages LLMs and scripting guides to create verified software-specific scripts through task generation and skill validation, significantly enhancing automation success while reducing response time and computational costs compared to runtime code generation.

Authors:Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, Emad Barsoum
Title: PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Abstract:
The autoregressive nature of large language models (LLMs) limits inference speed. Each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding alleviates this issue using a draft-then-verify approach to accelerate token generation. However, the overhead introduced during the draft phase and the training cost of the draft model limit the efficiency and adaptability of speculative decoding. In this work, we introduce PARallel Draft (PARD), a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models. PARD enhances inference efficiency by predicting multiple future tokens in a single forward pass of the draft phase, and incorporates a conditional drop token method to accelerate training. Its target-independence property allows a single draft model to be applied to an entire family of different models, minimizing the adaptation cost. Our proposed conditional drop token method can improves draft model training efficiency by 3x. On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.
中文摘要:PARD提出了一种新颖的推测解码方法,将自回归草稿模型转换为并行模型,通过单次前向预测多个令牌并采用条件丢弃令牌技术,大幅提升了推理速度和训练效率。
English Summary: PARD introduces a speculative decoding method that transforms autoregressive draft models into parallel ones, enabling multiple token predictions per forward pass and significantly boosting inference speed and training efficiency.

Authors:Zijing Shi, Meng Fang, Ling Chen
Title: Monte Carlo Planning with Large Language Model for Text-Based Game Agents
Abstract:
Text-based games provide valuable environments for language-based autonomous agents. However, planning-then-learning paradigms, such as those combining Monte Carlo Tree Search (MCTS) and reinforcement learning (RL), are notably time-consuming due to extensive iterations. Additionally, these algorithms perform uncertainty-driven exploration but lack language understanding and reasoning abilities. In this paper, we introduce the Monte Carlo planning with Dynamic Memory-guided Large language model (MC-DML) algorithm. MC-DML leverages the language understanding and reasoning capabilities of Large Language Models (LLMs) alongside the exploratory advantages of tree search algorithms. Specifically, we enhance LLMs with in-trial and cross-trial memory mechanisms, enabling them to learn from past experiences and dynamically adjust action evaluations during planning. We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. This demonstrates the effectiveness of our algorithm, paving the way for more efficient language-grounded planning in complex environments.
中文: MC-DML算法结合大语言模型的推理能力与树搜索,通过动态记忆机制提升规划效率,在文本游戏中显著优于现有方法。
English: The MC-DML algorithm integrates large language models' reasoning with tree search, using dynamic memory to enhance planning efficiency and outperform existing methods in text-based games.

Authors:Xuhui Zhou, Zhe Su, Sophie Feng, Jiaxu Zhou, Jen-tse Huang, Hsien-Te Kao, Spencer Lynch, Svitlana Volkova, Tongshuang Sherry Wu, Anita Woolley, Hao Zhu, Maarten Sap
Title: SOTOPIA-S4: a user-friendly system for flexible, customizable, and large-scale social simulation
Abstract:
Social simulation through large language model (LLM) agents is a promising approach to explore and validate hypotheses related to social science questions and LLM agents behavior. We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate multi-turn and multi-party LLM-based interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation and multi-party planning scenarios.
中文:SOTOPIA-S4是一个快速、灵活且可扩展的社会模拟系统,通过LLM代理实现可定制的多轮多方交互,用于社会科学假设检验,提供包含模拟引擎、API服务器和网页界面的pip包,方便技术与非技术用户使用。
English: SOTOPIA-S4 is a fast and scalable social simulation system that enables customizable multi-turn, multi-party interactions using LLM agents for hypothesis testing in social science, accessible via a pip package with an engine, API server, and web interface for both technical and non-technical users.

Authors:Qirui Yang, Fangpu Zhang, Yeying Jin, Qihua Cheng, Peng-Tao Jiang, Huanjing Yue, Jingyu Yang
Title: DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy
Abstract:
With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moiré artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoiréing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoiréing framework, Dual-Stream Demoiréing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moiré while preserving luminance and color fidelity. Specifically, to guide luminance correction and moiré removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation and achieves an inference speed $\mathrm{\textbf{2.4x}}$ faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at https://xxxxxxxxdsdnet.github.io/DSDNet/.
Chinese: 本文提出的DSDNet单阶段原始域去摩尔纹框架,在有效消除摩尔纹的同时保持了色彩保真度,并实现了比现有方法显著更快的推理速度。
English: This paper introduces DSDNet, a single-stage raw domain framework that effectively removes moiré artifacts while maintaining color fidelity and achieving significantly faster inference speeds than existing methods.

Authors:Xiucheng Wang, Qiming Zhang, Nan Cheng, Ruijin Sun, Zan Li, Shuguang Cui, Xuemin Shen
Title: RadioDiff-$k^2$: Helmholtz Equation Informed Generative Diffusion Model for Multi-Path Aware Radio Map Construction
Abstract:
In this paper, we propose a novel physics-informed generative learning approach, termed RadioDiff-$\bm{k^2}$, for accurate and efficient multipath-aware radio map (RM) construction. As wireless communication evolves towards environment-aware paradigms, driven by the increasing demand for intelligent and proactive optimization in sixth-generation (6G) networks, accurate construction of RMs becomes crucial yet highly challenging. Conventional electromagnetic (EM)-based methods, such as full-wave solvers and ray-tracing approaches, exhibit substantial computational overhead and limited adaptability to dynamic scenarios. Although, existing neural network (NN) approaches have efficient inferencing speed, they lack sufficient consideration of the underlying physics of EM wave propagation, limiting their effectiveness in accurately modeling critical EM singularities induced by complex multipath environments. To address these fundamental limitations, we propose a novel physics-inspired RM construction method guided explicitly by the Helmholtz equation, which inherently governs EM wave propagation. Specifically, we theoretically establish a direct correspondence between EM singularities, which correspond to the critical spatial features influencing wireless propagation, and regions defined by negative wave numbers in the Helmholtz equation. Based on this insight, we design an innovative dual generative diffusion model (DM) framework comprising one DM dedicated to accurately inferring EM singularities and another DM responsible for reconstructing the complete RM using these singularities along with environmental contextual information. Our physics-informed approach uniquely combines the efficiency advantages of data-driven methods with rigorous physics-based EM modeling, significantly enhancing RM accuracy, particularly in complex propagation environments dominated by multipath effects.
Chinese: 本文提出RadioDiff-k²这一物理信息生成学习方法,通过基于亥姆霍兹方程的双扩散模型,在复杂多径环境中精确建模电磁奇点,从而实现高精度无线电地图构建。
English: This paper introduces RadioDiff-k², a physics-informed generative learning method that uses dual diffusion models guided by the Helmholtz equation to accurately construct radio maps by modeling electromagnetic singularities in complex multipath environments.

Authors:Victoria Marie Tuck, Hardik Parwana, Pei-Wei Chen, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, S. Shankar Sastry, Sanjit A. Seshia
Title: MRTA-Sim: A Modular Simulator for Multi-Robot Allocation, Planning, and Control in Open-World Environments
Abstract:
This paper introduces MRTA-Sim, a Python/ROS2/Gazebo simulator for testing approaches to Multi-Robot Task Allocation (MRTA) problems on simulated robots in complex, indoor environments. Grid-based approaches to MRTA problems can be too restrictive for use in complex, dynamic environments such in warehouses, department stores, hospitals, etc. However, approaches that operate in free-space often operate at a layer of abstraction above the control and planning layers of a robot and make an assumption on approximate travel time between points of interest in the system. These abstractions can neglect the impact of the tight space and multi-agent interactions on the quality of the solution. Therefore, MRTA solutions should be tested with the navigation stacks of the robots in mind, taking into account robot planning, conflict avoidance between robots, and human interaction and avoidance. This tool connects the allocation output of MRTA solvers to individual robot planning using the NAV2 stack and local, centralized multi-robot deconfliction using Control Barrier Function-Quadrtic Programs (CBF-QPs), creating a platform closer to real-world operation for more comprehensive testing of these approaches. The simulation architecture is modular so that users can swap out methods at different levels of the stack. We show the use of our system with a Satisfiability Modulo Theories (SMT)-based approach to dynamic MRTA on a fleet of indoor delivery robots.
Chinese: 本文介绍了MRTA-Sim,一个基于Python/ROS2/Gazebo的仿真平台,通过将多机器人任务分配求解器与机器人导航系统相结合,在复杂室内环境中测试任务分配方案,同时考虑机器人路径规划和冲突避免等实际约束。
English: This paper presents MRTA-Sim, a Python/ROS2/Gazebo simulator that integrates MRTA solvers with robot navigation stacks to test multi-robot task allocation in complex indoor environments while accounting for real-world constraints like robot planning and conflict avoidance.

Authors:Chen Zhao, Anjum Shaik, Joyce H. Keyak, Nancy E. Lane, Jeffrey D. Deng, Kuan-Jui Su, Qiuying Sha, Hui Shen, Hong-Wen Deng, Weihua Zhou
Title: ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images
Abstract:
Hip fractures represent a major health concern, particularly among the elderly, often leading decreased mobility and increased mortality. Early and accurate detection of at risk individuals is crucial for effective intervention. In this study, we propose Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX), a novel approach for predicting hip fractures using Dual-energy X-ray Absorptiometry (DXA) images. ICGM-FRAX involves iteratively comparing a test (subject) graph with multiple template graphs representing the characteristics of hip fracture subjects to assess the similarity and accurately to predict hip fracture risk. These graphs are obtained as follows. The DXA images are separated into multiple regions of interest (RoIs), such as the femoral head, shaft, and lesser trochanter. Radiomic features are then calculated for each RoI, with the central coordinates used as nodes in a graph. The connectivity between nodes is established according to the Euclidean distance between these coordinates. This process transforms each DXA image into a graph, where each node represents a RoI, and edges derived by the centroids of RoIs capture the spatial relationships between them. If the test graph closely matches a set of template graphs representing subjects with incident hip fractures, it is classified as indicating high hip fracture risk. We evaluated our method using 547 subjects from the UK Biobank dataset, and experimental results show that ICGM-FRAX achieved a sensitivity of 0.9869, demonstrating high accuracy in predicting hip fractures.
中文: 本研究提出ICGM-FRAX方法,通过迭代比对DXA图像生成的骨骼区域特征图来预测髋部骨折风险,在英国生物银行数据验证中展现出高灵敏度。
English: The study introduces ICGM-FRAX, a novel method using iterative graph matching of DXA image features to accurately predict hip fracture risk, achieving high sensitivity in validation with the UK Biobank dataset.

Authors:Donghyeong Kim, Chaewon Park, Suhwan Cho, Hyeonjeong Lim, Minseok Kang, Jungho Lee, Sangyoun Lee
Title: GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection
Abstract:
Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.
中文: GenCLIP提出了一种创新框架,通过多层提示和双分支推理有效学习通用提示,结合自适应文本过滤机制,提升了零样本异常检测的泛化能力和类别特异性,确保视觉-语言对齐的稳定性。
English: GenCLIP introduces a novel framework that enhances zero-shot anomaly detection by employing multi-layer prompting and dual-branch inference to learn general prompts effectively, improving both generalization and specificity while incorporating adaptive text filtering for robust vision-language alignment.

Authors:Songtao Jiang, Yuan Wang, Sibo Song, Yan Zhang, Zijie Meng, Bohan Lei, Jian Wu, Jimeng Sun, Zuozhu Liu
Title: OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Abstract:
The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.
中文: 针对现有医学视觉语言模型采用不同模态分离编码器的局限,OmniV-Med提出统一框架,通过构建大规模多模态数据集、设计旋转位置自适应编码器和医疗感知令牌剪枝机制,在多种医学影像和视频任务中实现最优性能,同时支持高效训练与长视频推理。
English: To overcome the limitations of separate encoders for different medical visual modalities, OmniV-Med introduces a unified framework with a comprehensive multimodal dataset, a rotary position-adaptive encoder, and a medical-aware token pruning mechanism, achieving state-of-the-art performance on various benchmarks while enabling efficient training and inference.

Authors:Yaodan Xu, Sheng Zhou, Zhisheng Niu
Title: Joint Optimization of Offloading, Batching and DVFS for Multiuser Co-Inference
Abstract:
With the growing integration of artificial intelligence in mobile applications, a substantial number of deep neural network (DNN) inference requests are generated daily by mobile devices. Serving these requests presents significant challenges due to limited device resources and strict latency requirements. Therefore, edge-device co-inference has emerged as an effective paradigm to address these issues. In this study, we focus on a scenario where multiple mobile devices offload inference tasks to an edge server equipped with a graphics processing unit (GPU). For finer control over offloading and scheduling, inference tasks are partitioned into smaller sub-tasks. Additionally, GPU batch processing is employed to boost throughput and improve energy efficiency. This work investigates the problem of minimizing total energy consumption while meeting hard latency constraints. We propose a low-complexity Joint DVFS, Offloading, and Batching strategy (J-DOB) to solve this problem. The effectiveness of the proposed algorithm is validated through extensive experiments across varying user numbers and deadline constraints. Results show that J-DOB can reduce energy consumption by up to 51.30% and 45.27% under identical and different deadlines, respectively, compared to local computing.
Chinese: 本研究提出J-DOB低复杂度策略,通过联合优化动态电压频率调节、任务卸载与GPU批处理,在满足严格延迟要求的前提下,显著降低了移动设备在边缘协同推理场景中的能耗。
English: This study introduces J-DOB, a low-complexity strategy that jointly optimizes dynamic voltage and frequency scaling, task offloading, and GPU batching to minimize mobile devices' energy consumption while meeting strict latency requirements in edge-device co-inference scenarios.

Authors:Chongye Guo, Jinhu Fu, Junfeng Fang, Kun Wang, Guorui Feng
Title: REDEditing: Relationship-Driven Precise Backdoor Poisoning on Text-to-Image Diffusion Models
Abstract:
The rapid advancement of generative AI highlights the importance of text-to-image (T2I) security, particularly with the threat of backdoor poisoning. Timely disclosure and mitigation of security vulnerabilities in T2I models are crucial for ensuring the safe deployment of generative models. We explore a novel training-free backdoor poisoning paradigm through model editing, which is recently employed for knowledge updating in large language models. Nevertheless, we reveal the potential security risks posed by model editing techniques to image generation models. In this work, we establish the principles for backdoor attacks based on model editing, and propose a relationship-driven precise backdoor poisoning method, REDEditing. Drawing on the principles of equivalent-attribute alignment and stealthy poisoning, we develop an equivalent relationship retrieval and joint-attribute transfer approach that ensures consistent backdoor image generation through concept rebinding. A knowledge isolation constraint is proposed to preserve benign generation integrity. Our method achieves an 11\% higher attack success rate compared to state-of-the-art approaches. Remarkably, adding just one line of code enhances output naturalness while improving backdoor stealthiness by 24\%. This work aims to heighten awareness regarding this security vulnerability in editable image generation models.
中文摘要:本研究提出REDEditing方法,通过模型编辑技术对文本到图像模型实施无需训练的后门投毒攻击,在提升攻击成功率的同时增强了隐蔽性,揭示了可编辑生成式AI系统中存在的重大安全风险。
English Summary: This study introduces REDEditing, a training-free backdoor poisoning method for text-to-image models that achieves higher attack success rates and improved stealth through model editing techniques, highlighting critical security vulnerabilities in editable generative AI systems.

Authors:Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux
Title: Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
Abstract:
This report details MERL's system for room impulse response (RIR) estimation submitted to the Generative Data Augmentation Workshop at ICASSP 2025 for Augmenting RIR Data (Task 1) and Improving Speaker Distance Estimation (Task 2). We first pre-train a neural acoustic field conditioned by room geometry on an external large-scale dataset in which pairs of RIRs and the geometries are provided. The neural acoustic field is then adapted to each target room by using the enrollment data, where we leverage either the provided room geometries or geometries retrieved from the external dataset, depending on availability. Lastly, we predict the RIRs for each pair of source and receiver locations specified by Task 1, and use these RIRs to train the speaker distance estimation model in Task 2.
中文: 本报告介绍了MERL的室内脉冲响应估计系统,该系统基于外部数据集预训练神经网络声场,利用注册数据将其适配到目标房间,并预测指定声源-接收器对的脉冲响应,从而训练说话人距离估计模型。
English: This report presents MERL's system for room impulse response (RIR) estimation, which involves pre-training a neural acoustic field with room geometry on an external dataset, adapting it to target rooms using enrollment data, and then predicting RIRs for specified source-receiver pairs to train a speaker distance estimation model.

Authors:Marcello Bullo, Amir Ashtari Gargari, Paolo Testolina, Michele Zorzi, Marco Giordani
Title: Statistical Analysis and End-to-End Performance Evaluation of Traffic Models for Automotive Data
Abstract:
Autonomous driving is a major paradigm shift in transportation, with the potential to enhance safety, optimize traffic congestion, and reduce fuel consumption. Although autonomous vehicles rely on advanced sensors and on-board computing systems to navigate without human control, full awareness of the driving environment also requires a cooperative effort via Vehicle-To-Everything (V2X) communication. Specifically, vehicles send and receive sensor perceptions to/from other vehicles to extend perception beyond their own sensing range. However, transmitting large volumes of data can be challenging for current V2X communication technologies, so data compression represents a crucial solution to reduce the message size and link congestion. In this paper, we present a statistical characterization of automotive data, focusing on LiDAR sensors. Notably, we provide models for the size of both raw and compressed point clouds. The use of statistical traffic models offers several advantages compared to using real data, such as faster simulations, reduced storage requirements, and greater flexibility in the application design. Furthermore, statistical models can be used for understanding traffic patterns and analyzing statistics, which is crucial to design and optimize wireless networks. We validate our statistical models via a Kolmogorov-Smirnoff test implementing a Bootstrap Resampling scheme. Moreover, we show via ns-3 simulations that using statistical models yields comparable results in terms of latency and throughput compared to real data, which also demonstrates the accuracy of the models.
自动驾驶通过V2X通信提升交通效率,但需数据压缩解决传输难题,本文提出的LiDAR统计模型经验证在仿真中与真实数据性能相当。
Autonomous driving enhances transportation through V2X communication, which requires data compression for efficient transmission, and this paper presents validated statistical models for LiDAR data that match real data performance in simulations.

Authors:Haoxuan Li, Yi Bin, Yunshan Ma, Guoqing Wang, Yang Yang, See-Kiong Ng, Tat-Seng Chua
Title: SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs
Abstract:
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
中文: 该摘要提出SemCORE框架,通过结构化标识符和生成式语义验证增强跨模态检索的语义理解能力,在多项基准测试中显著优于现有生成式检索方法。
English: This abstract introduces SemCORE, a novel generative cross-modal retrieval framework that enhances semantic understanding through structured identifiers and generative verification, achieving significant performance improvements over existing methods.

Authors:Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, Ping Luo
Title: RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Abstract:
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
中文: RoboTwin提出了一种生成式数字孪生框架,为双臂机器人任务创建多样化专家数据集和现实世界对齐基准,通过模拟预训练使单臂和双臂任务成功率分别提升超过70%和40%。
English: RoboTwin introduces a generative digital twin framework that creates diverse expert datasets and real-world-aligned benchmarks for dual-arm robotic tasks, achieving over 70% and 40% success rate improvements in single-arm and dual-arm tasks respectively through simulated pre-training.

Authors:Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang
Title: MAIN: Mutual Alignment Is Necessary for instruction tuning
Abstract:
Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs. All code is available from our repository.
中文:提出的互对齐框架(MAIN)通过确保指令与响应之间的连贯性来增强指令调优,在多种模型和基准测试中实现了卓越性能。
English: The proposed Mutual Alignment Framework (MAIN) enhances instruction tuning by ensuring coherence between instructions and responses, achieving superior performance across various models and benchmarks.

Authors:Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li
Title: TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Abstract:
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
中文:TongUI框架通过利用多模态网络教程构建GUI-Net数据集,开发通用图形用户界面代理,使优化后的模型在基础定位和导航基准测试中实现显著性能提升。
English: The TongUI framework develops generalized GUI agents by leveraging multimodal web tutorials to create the GUI-Net dataset, which enables fine-tuned models to achieve significant performance improvements on grounding and navigation benchmarks.

Authors:Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates
Title: Plain Transformers Can be Powerful Graph Learners
Abstract:
Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated competitors such as subgraph GNNs and higher-order GNNs. Its outstanding empirical performance across various graph datasets also justifies the practical effectiveness of PPGT.
中文: 本研究通过引入简化的L2注意力、自适应均方根归一化和基于MLP的位置编码这三个简单修改,构建了强大的普通图变换器(PPGT),证明了普通Transformer架构可作为高效的图学习器,并在理论和实证层面均展现出卓越的图数据性能。
English: This work demonstrates that a plain Transformer can be a powerful graph learner by incorporating three simple modifications—simplified L2 attention, adaptive root-mean-square normalization, and an MLP-based stem for positional encoding—resulting in the Powerful Plain Graph Transformer (PPGT), which achieves strong theoretical and empirical performance across graph datasets.

Authors:Kaifeng Gao, Siqi Chen, Hanwang Zhang, Jun Xiao, Yueting Zhuang, Qianru Sun
Title: Generalized Visual Relation Detection with Diffusion Models
Abstract:
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
Chinese: 本文提出Diff-VRD模型,通过扩散方法将视觉关系建模为连续嵌入,突破预定义类别限制,在人物-物体交互和场景图生成任务中实现了更广义的视觉关系检测。
English: This paper introduces Diff-VRD, a diffusion-based model that addresses the semantic ambiguity in visual relation detection by generating continuous relation embeddings beyond pre-defined categories, enhancing generalization in tasks like human-object interaction and scene graph generation.

Authors:Dazhong Shen, Guanglu Song, Yi Zhang, Bingqi Ma, Lujundong Li, Dongzhi Jiang, Zhuofan Zong, Yu Liu
Title: ADT: Tuning Diffusion Models with Adversarial Supervision
Abstract:
Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.
中文: 提出的对抗扩散调优(ADT)框架通过在优化过程中引入对抗监督,有效弥合了扩散模型训练与推理之间的差距,显著提升了多个Stable Diffusion版本的分布对齐效果和图像质量。
English: The proposed Adversarial Diffusion Tuning (ADT) framework bridges the training-inference gap in diffusion models by incorporating adversarial supervision during optimization, significantly enhancing distribution alignment and image quality across multiple Stable Diffusion versions.

Authors:Qiaosi Wang, Xuhui Zhou, Maarten Sap, Jodi Forlizzi, Hong Shen
Title: Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective
Abstract:
The last couple of years have witnessed emerging research that appropriates Theory-of-Mind (ToM) tasks designed for humans to benchmark LLM's ToM capabilities as an indication of LLM's social intelligence. However, this approach has a number of limitations. Drawing on existing psychology and AI literature, we summarize the theoretical, methodological, and evaluation limitations by pointing out that certain issues are inherently present in the original ToM tasks used to evaluate human's ToM, which continues to persist and exacerbated when appropriated to benchmark LLM's ToM. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks in a more dynamic, interactional approach that accounts for user preferences, needs, and experiences with LLMs in such evaluations. We conclude by outlining potential opportunities and challenges towards this direction.
中文摘要:近期研究采用人类心智理论任务评估大语言模型社交智能存在诸多局限,呼吁建立更动态、交互式的评估标准,充分考虑用户需求与体验。
English Summary: Recent research using human-designed Theory-of-Mind tasks to evaluate LLMs' social intelligence faces significant limitations, prompting a need for more dynamic, interaction-focused benchmarks that consider user experiences.

Authors:Jinhao Li, Zijian Chen, Runze Jiang, Tingzhu Chen, Changbo Wang, Guangtao Zhai
Title: Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark
Abstract:
The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.
中文总结:本研究提出了结构对齐的甲骨文数据集Oracle-P15K和基于扩散模型的生成器OBIDiff,能够将真实拓印风格迁移至字形图像,同时有效保持文字结构特征。
English Summary: This study introduces Oracle-P15K, a structure-aligned oracle bone inscription dataset, and OBIDiff, a diffusion model that generates realistic inscriptions by transferring rubbing styles to glyph images while preserving structural accuracy.

Authors:Wenjin Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum
Title: DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models
Abstract:
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
中文: 提出的权重分解低秩量化感知训练(DL-QAT)方法通过仅训练不到1%的参数,显著提升了量化效率,在LLaMA和LLaMA2等大语言模型的下游任务中表现出卓越性能。
English: The proposed Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT) method enhances quantization efficiency by training less than 1% of parameters, achieving superior performance in downstream tasks across various LLMs like LLaMA and LLaMA2.

Authors:Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
Title: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Abstract:
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
中文摘要:ScreenSpot-Pro是一个针对专业GUI环境评估多模态大语言模型的新基准,现有模型在此表现不佳,而提出的ScreenSeekeR方法无需额外训练即可显著提升性能。
English Summary: ScreenSpot-Pro is a new benchmark for evaluating Multi-modal Large Language Models (MLLMs) in professional GUI environments, where existing models struggle, and the proposed ScreenSeekeR method significantly improves performance without additional training.

Authors:Xiaohao Liu, Teng Tu, Yunshan Ma, Tat-Seng Chua
Title: Extending Visual Dynamics for Video-to-Music Generation
Abstract:
Music profoundly enhances video production by improving quality, engagement, and emotional resonance, sparking growing interest in video-to-music generation. Despite recent advances, existing approaches remain limited in specific scenarios or undervalue the visual dynamics. To address these limitations, we focus on tackling the complexity of dynamics and resolving temporal misalignment between video and music representations. To this end, we propose DyViM, a novel framework to enhance dynamics modeling for video-to-music generation. Specifically, we extract frame-wise dynamics features via a simplified motion encoder inherited from optical flow methods, followed by a self-attention module for aggregation within frames. These dynamic features are then incorporated to extend existing music tokens for temporal alignment. Additionally, high-level semantics are conveyed through a cross-attention mechanism, and an annealing tuning strategy benefits to fine-tune well-trained music decoders efficiently, therefore facilitating seamless adaptation. Extensive experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
中文:DyViM框架通过运动编码和注意力机制增强动态特征建模与时间对齐,在视频到音乐的生成任务中展现出优于现有方法的性能。
English: The proposed DyViM framework improves video-to-music generation by enhancing dynamic feature modeling and temporal alignment through motion encoding and attention mechanisms, demonstrating superior performance over existing methods.

Authors:Matteo Nerini, Bruno Clerckx
Title: Analog Computing for Signal Processing and Communications -- Part II: Toward Gigantic MIMO Beamforming
Abstract:
Analog-domain operations offer a promising solution to accelerating signal processing and enabling future multiple-input multiple-output (MIMO) communications with thousands of antennas. In Part I of this paper, we have introduced a microwave linear analog computer (MiLAC) as an analog computer that processes microwave signals linearly, demonstrating its potential to reduce the computational complexity of specific signal processing tasks. In Part II of this paper, we extend these benefits to wireless communications, showcasing how MiLAC enables gigantic MIMO beamforming entirely in the analog domain. MiLAC-aided beamforming enables the maximum flexibility and performance of digital beamforming, while significantly reducing hardware costs by minimizing the number of radio-frequency (RF) chains and only relying on low-resolution analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). In addition, it eliminates per-symbol operations by completely avoiding digital-domain processing and remarkably reduces the computational complexity of zero-forcing (ZF), which scales quadratically with the number of antennas instead of cubically. It also processes signals with fixed matrices, e.g., the discrete Fourier transform (DFT), directly in the analog domain. Numerical results show that it can perform ZF and DFT with a computational complexity reduction of up to $1.5\times 10^4$ and $4.0\times 10^7$ times, respectively, compared to digital beamforming.
中文: MiLAC系统在纯模拟域实现高效的大规模MIMO波束成形,在保持数字波束成形灵活性的同时,通过大幅减少射频链路数量和使用低分辨率转换器,将计算复杂度降低了数个数量级。
English: The MiLAC system enables efficient gigantic MIMO beamforming entirely in the analog domain, achieving digital-like flexibility while drastically reducing hardware costs and computational complexity by orders of magnitude.

Authors:Matteo Nerini, Bruno Clerckx
Title: Analog Computing for Signal Processing and Communications -- Part I: Computing with Microwave Networks
Abstract:
Analog computing has been recently revived due to its potential for energy-efficient and highly parallel computations. In this two-part paper, we explore analog computers that linearly process microwave signals, named microwave linear analog computers (MiLACs), and their applications in signal processing and communications. In Part I of this paper, we model a MiLAC as a multiport microwave network with tunable impedance components, enabling the execution of mathematical operations by reconfiguring the microwave network and applying input signals at its ports. We demonstrate that a MiLAC can efficiently compute the linear minimum mean square error (LMMSE) estimator and matrix inversion, with remarkably low computational complexity. Specifically, a matrix can be inverted with complexity growing with the square of its size. We also show how a MiLAC can be used jointly with digital operations to implement sophisticated algorithms such as the Kalman filter. To enhance practicability, we propose a design of MiLAC based on lossless impedance components, reducing power consumption and eliminating the need for costly active components. In Part II of this paper, we investigate the applications of MiLACs in wireless communications, highlighting their potential to enable future wireless systems by executing computations and beamforming in the analog domain.
中文摘要:本两篇论文提出微波线性模拟计算机(MiLAC)作为高效节能的微波信号模拟处理器,通过可调微波网络实现矩阵求逆等运算,并探讨其在无线通信系统中的实际应用与波束成形潜力。
English Summary: This two-part paper introduces microwave linear analog computers (MiLACs) as energy-efficient analog processors for microwave signals, demonstrating their capabilities in matrix inversion and signal processing with low complexity while proposing practical designs and wireless communication applications.

Authors:Xinyi Wang, Taekyung Kim, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou
Title: Safe Navigation in Uncertain Crowded Environments Using Risk Adaptive CVaR Barrier Functions
Abstract:
Robot navigation in dynamic, crowded environments poses a significant challenge due to the inherent uncertainties in the obstacle model. In this work, we propose a risk-adaptive approach based on the Conditional Value-at-Risk Barrier Function (CVaR-BF), where the risk level is automatically adjusted to accept the minimum necessary risk, achieving a good performance in terms of safety and optimization feasibility under uncertainty. Additionally, we introduce a dynamic zone-based barrier function which characterizes the collision likelihood by evaluating the relative state between the robot and the obstacle. By integrating risk adaptation with this new function, our approach adaptively expands the safety margin, enabling the robot to proactively avoid obstacles in highly dynamic environments. Comparisons and ablation studies demonstrate that our method outperforms existing social navigation approaches, and validate the effectiveness of our proposed framework.
中文: 本文提出了一种基于条件风险价值屏障函数和动态区域屏障函数的自适应风险机器人导航方法,通过自动调整风险水平和安全边界,在动态拥挤环境中实现了比现有方法更安全高效的避障效果。
English: This paper introduces a risk-adaptive robot navigation method using Conditional Value-at-Risk Barrier Functions and dynamic zone-based barrier functions to automatically adjust risk levels and safety margins, enabling safer and more efficient obstacle avoidance in dynamic crowded environments compared to existing approaches.

Authors:Zeynab Kaseb, Matthias Moller, Peter Palensky, Pedro P. Vergara
Title: Solving Power System Problems using Adiabatic Quantum Computing
Abstract:
This paper proposes a novel combinatorial optimization framework that reformulates existing power system problems into a format executable on quantum annealers. The proposed framework accommodates both normal and complex numbers and enables efficient handling of large-scale problems, thus ensuring broad applicability across power system problems. As a proof of concept, we demonstrate its applicability in two classical problems: (i) power system parameter identification, where we estimate the admittance matrix given voltage and current measurements, and (ii) power flow analysis, where we reformulate the nonlinear equations governing active and reactive power balance. The results show that the proposed framework effectively and efficiently solves both linear and nonlinear power system problems, and thus offers significant advantages in scenarios where traditional solvers face challenges, such as ill-conditioned systems and fault conditions.
中文摘要:本文提出了一种新颖的组合优化框架,将电力系统问题重构为可在量子退火器上执行的格式,有效解决了参数辨识和潮流分析等线性和非线性问题。
English Summary: This paper introduces a novel combinatorial optimization framework that reformulates power system problems for quantum annealers, demonstrating effective solutions for both linear and nonlinear issues like parameter identification and power flow analysis.

Authors:Tianyi Jiang, Zeyu Wang, Shanqing Yu, Qi Xuan
Title: Adaptive Substructure-Aware Expert Model for Molecular Property Prediction
Abstract:
Molecular property prediction is essential for applications such as drug discovery and toxicity assessment. While Graph Neural Networks (GNNs) have shown promising results by modeling molecules as molecular graphs, their reliance on data-driven learning limits their ability to generalize, particularly in the presence of data imbalance and diverse molecular substructures. Existing methods often overlook the varying contributions of different substructures to molecular properties, treating them uniformly. To address these challenges, we propose ASE-Mol, a novel GNN-based framework that leverages a Mixture-of-Experts (MoE) approach for molecular property prediction. ASE-Mol incorporates BRICS decomposition and significant substructure awareness to dynamically identify positive and negative substructures. By integrating a MoE architecture, it reduces the adverse impact of negative motifs while improving adaptability to positive motifs. Experimental results on eight benchmark datasets demonstrate that ASE-Mol achieves state-of-the-art performance, with significant improvements in both accuracy and interpretability.
中文: ASE-Mol提出了一种基于专家混合的图神经网络框架,能动态识别分子子结构,在分子性质预测中实现了最优性能。
English: ASE-Mol introduces a GNN framework using Mixture-of-Experts to dynamically identify molecular substructures, achieving state-of-the-art performance in property prediction.

Authors:Kleanthis Malialis, Stylianos Filippou, Christos G. Panayiotou, Marios M. Polycarpou
Title: SiameseDuo++: Active Learning from Data Streams with Dual Augmented Siamese Networks
Abstract:
Data stream mining, also known as stream learning, is a growing area which deals with learning from high-speed arriving data. Its relevance has surged recently due to its wide range of applicability, such as, critical infrastructure monitoring, social media analysis, and recommender systems. The design of stream learning methods faces significant research challenges; from the nonstationary nature of the data (referred to as concept drift) and the fact that data streams are typically not annotated with the ground truth, to the requirement that such methods should process large amounts of data in real-time with limited memory. This work proposes the SiameseDuo++ method, which uses active learning to automatically select instances for a human expert to label according to a budget. Specifically, it incrementally trains two siamese neural networks which operate in synergy, augmented by generated examples. Both the proposed active learning strategy and augmentation operate in the latent space. SiameseDuo++ addresses the aforementioned challenges by operating with limited memory and limited labelling budget. Simulation experiments show that the proposed method outperforms strong baselines and state-of-the-art methods in terms of learning speed and/or performance. To promote open science we publicly release our code and datasets.
中文: SiameseDuo++是一种主动学习方法,通过协同训练双孪生神经网络并利用潜在空间增强,在有限内存和标注预算下解决数据流挖掘中的概念漂移和标注稀缺问题,实验表明其性能优于现有先进方法。
English: SiameseDuo++ is an active learning method that trains dual neural networks with limited memory and labeling budget to address concept drift and annotation scarcity in data stream mining, demonstrating superior performance in simulations.

Authors:Yapeng Mi, Zhi Gao, Xiaojian Ma, Qing Li
Title: Building LLM Agents by Incorporating Insights from Computer Systems
Abstract:
LLM-driven autonomous agents have emerged as a promising direction in recent years. However, many of these LLM agents are designed empirically or based on intuition, often lacking systematic design principles, which results in diverse agent structures with limited generality and scalability. In this paper, we advocate for building LLM agents by incorporating insights from computer systems. Inspired by the von Neumann architecture, we propose a structured framework for LLM agentic systems, emphasizing modular design and universal principles. Specifically, this paper first provides a comprehensive review of LLM agents from the computer system perspective, then identifies key challenges and future directions inspired by computer system design, and finally explores the learning mechanisms for LLM agents beyond the computer system. The insights gained from this comparative analysis offer a foundation for systematic LLM agent design and advancement.
中文: 本文借鉴计算机系统(尤其是冯·诺依曼架构)提出了一种结构化的LLM驱动自主智能体框架,旨在解决当前设计缺乏系统性原则的问题,并提升通用性和可扩展性。
English: This paper proposes a structured framework for LLM-driven autonomous agents inspired by computer systems, particularly the von Neumann architecture, to address the lack of systematic design principles and enhance generality and scalability.

Authors:Jiaxun Zhang, Yanchen Guan, Chengyue Wang, Haicheng Liao, Guohui Zhang, Zhenning Li
Title: LATTE: Lightweight Attention-based Traffic Accident Anticipation Engine
Abstract:
Accurately predicting traffic accidents in real-time is a critical challenge in autonomous driving, particularly in resource-constrained environments. Existing solutions often suffer from high computational overhead or fail to adequately address the uncertainty of evolving traffic scenarios. This paper introduces LATTE, a Lightweight Attention-based Traffic Accident Anticipation Engine, which integrates computational efficiency with state-of-the-art performance. LATTE employs Efficient Multiscale Spatial Aggregation (EMSA) to capture spatial features across scales, Memory Attention Aggregation (MAA) to enhance temporal modeling, and Auxiliary Self-Attention Aggregation (AAA) to extract latent dependencies over extended sequences. Additionally, LATTE incorporates the Flamingo Alert-Assisted System (FAA), leveraging a vision-language model to provide real-time, cognitively accessible verbal hazard alerts, improving passenger situational awareness. Evaluations on benchmark datasets (DAD, CCD, A3D) demonstrate LATTE's superior predictive capabilities and computational efficiency. LATTE achieves state-of-the-art 89.74% Average Precision (AP) on DAD benchmark, with 5.4% higher mean Time-To-Accident (mTTA) than the second-best model, and maintains competitive mTTA at a Recall of 80% (TTA@R80) (4.04s) while demonstrating robust accident anticipation across diverse driving conditions. Its lightweight design delivers a 93.14% reduction in floating-point operations (FLOPs) and a 31.58% decrease in parameter count (Params), enabling real-time operation on resource-limited hardware without compromising performance. Ablation studies confirm the effectiveness of LATTE's architectural components, while visualizations and failure case analyses highlight its practical applicability and areas for enhancement.
中文: 本文提出LATTE,一种基于注意力的轻量级交通事故预测引擎,通过高效多尺度空间聚合、记忆注意力聚合和辅助自注意力聚合等创新模块,在保持计算效率的同时实现了顶尖性能,在基准测试中展现出卓越的预测能力和显著降低的计算资源需求。
English: This paper introduces LATTE, a lightweight attention-based traffic accident anticipation engine that combines computational efficiency with state-of-the-art performance through innovative modules for spatial, temporal, and dependency modeling, achieving superior predictive accuracy and significant reductions in computational overhead on benchmark datasets.

Authors:Songtao Peng, Lei Wang, Wu Shuai, Hao Song, Jiajun Zhou, Shanqing Yu, Qi Xuan
Title: Hierarchical Local-Global Feature Learning for Few-shot Malicious Traffic Detection
Abstract:
With the rapid growth of internet traffic, malicious network attacks have become increasingly frequent and sophisticated, posing significant threats to global cybersecurity. Traditional detection methods, including rule-based and machine learning-based approaches, struggle to accurately identify emerging threats, particularly in scenarios with limited samples. While recent advances in few-shot learning have partially addressed the data scarcity issue, existing methods still exhibit high false positive rates and lack the capability to effectively capture crucial local traffic patterns. In this paper, we propose HLoG, a novel hierarchical few-shot malicious traffic detection framework that leverages both local and global features extracted from network sessions. HLoG employs a sliding-window approach to segment sessions into phases, capturing fine-grained local interaction patterns through hierarchical bidirectional GRU encoding, while simultaneously modeling global contextual dependencies. We further design a session similarity assessment module that integrates local similarity with global self-attention-enhanced representations, achieving accurate and robust few-shot traffic classification. Comprehensive experiments on three meticulously reconstructed datasets demonstrate that HLoG significantly outperforms existing state-of-the-art methods. Particularly, HLoG achieves superior recall rates while substantially reducing false positives, highlighting its effectiveness and practical value in real-world cybersecurity applications.
中文: 本文提出HLoG这一分层小样本恶意流量检测框架,通过融合局部与全局特征实现少量样本下的精准威胁识别,在显著降低误报率的同时提升召回率,性能优于现有最优方法。
English: This paper introduces HLoG, a hierarchical few-shot malicious traffic detection framework that integrates local and global features to accurately identify emerging cyber threats with minimal samples, significantly outperforming existing methods by reducing false positives and improving recall rates.

Authors:Yiqi Zhao, Emily Zhu, Bardh Hoxha, Georgios Fainekos, Jyotirmoy V. Deshmukh, Lars Lindemann
Title: Distributionally Robust Predictive Runtime Verification under Spatio-Temporal Logic Specifications
Abstract:
Cyber-physical systems (CPS) designed in simulators, often consisting of multiple interacting agents (e.g. in multi-agent formations), behave differently in the real-world. We want to verify these systems during runtime when they are deployed. We thus propose robust predictive runtime verification (RPRV) algorithms for: (1) general stochastic CPS under signal temporal logic (STL) tasks, and (2) stochastic multi-agent systems (MAS) under spatio-temporal logic tasks. The RPRV problem presents the following challenges: (1) there may not be sufficient data on the behavior of the deployed CPS, (2) predictive models based on design phase system trajectories may encounter distribution shift during real-world deployment, and (3) the algorithms need to scale to the complexity of MAS and be applicable to spatio-temporal logic tasks. To address the challenges, we assume knowledge of an upper bound on the statistical distance between the trajectory distributions of the system at deployment and design time. We are motivated by our prior work [1, 2] where we proposed an accurate and an interpretable RPRV algorithm for general CPS, which we here extend to the MAS setting and spatio-temporal logic tasks. Specifically, we use a learned predictive model to estimate the system behavior at runtime and robust conformal prediction to obtain probabilistic guarantees by accounting for distribution shifts. Building on [1], we perform robust conformal prediction over the robust semantics of spatio-temporal reach and escape logic (STREL) to obtain centralized RPRV algorithms for MAS. We empirically validate our results in a drone swarm simulator, where we show the scalability of our RPRV algorithms to MAS and analyze the impact of different trajectory predictors on the verification result. To the best of our knowledge, these are the first statistically valid algorithms for MAS under distribution shift.
中文: 本文提出了鲁棒预测性运行时验证(RPRV)算法,通过预测模型和保形预测处理信息物理系统和多智能体系统在现实部署中的性能差异与分布偏移,提供概率性验证保证。
English: The paper introduces robust predictive runtime verification (RPRV) algorithms to address real-world performance discrepancies and distribution shifts in cyber-physical systems and multi-agent systems, using predictive modeling and conformal prediction for probabilistic guarantees.

Authors:Renwu Li, Wenjing Ke, Dong Li, Lu Tian, Emad Barsoum
Title: MonoGS++: Fast and Accurate Monocular RGB Gaussian SLAM
Abstract:
We present MonoGS++, a novel fast and accurate Simultaneous Localization and Mapping (SLAM) method that leverages 3D Gaussian representations and operates solely on RGB inputs. While previous 3D Gaussian Splatting (GS)-based methods largely depended on depth sensors, our approach reduces the hardware dependency and only requires RGB input, leveraging online visual odometry (VO) to generate sparse point clouds in real-time. To reduce redundancy and enhance the quality of 3D scene reconstruction, we implemented a series of methodological enhancements in 3D Gaussian mapping. Firstly, we introduced dynamic 3D Gaussian insertion to avoid adding redundant Gaussians in previously well-reconstructed areas. Secondly, we introduced clarity-enhancing Gaussian densification module and planar regularization to handle texture-less areas and flat surfaces better. We achieved precise camera tracking results both on the synthetic Replica and real-world TUM-RGBD datasets, comparable to those of the state-of-the-art. Additionally, our method realized a significant 5.57x improvement in frames per second (fps) over the previous state-of-the-art, MonoGS.
中文: MonoGS++ 是一种仅使用RGB输入的快速精准SLAM方法,采用动态高斯插入和清晰度增强模块的3D高斯表示,在相机追踪精度上达到顶尖水平,并比MonoGS实现了5.57倍的帧率提升。
English: MonoGS++ is a fast and accurate RGB-only SLAM method that uses 3D Gaussian representations with dynamic insertion and clarity enhancement modules, achieving state-of-the-art camera tracking and a 5.57x speed improvement over MonoGS.

Authors:Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue
Title: EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling
Abstract:
When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, because of its superior ability in capturing high-frequency signals. However, existing event-based vibration recovery methods are still sub-optimal for sound recovery. In this work, we propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream. We first generate a large training set using a novel simulation pipeline. Then we designed a network that leverages the sparsity of events to capture spatial information and uses Mamba to model long-term temporal information. Lastly, we train a spatial aggregation block to aggregate information from different locations to further improve signal quality. To capture event signals caused by sound waves, we also designed an imaging system using a laser matrix to enhance the gradient and collected multiple data sequences for testing. Experimental results on synthetic and real-world data demonstrate the effectiveness of our method.
中文摘要:本研究提出了一种新型非接触式声音恢复方案,利用事件相机捕捉高频视觉振动,通过模拟训练流程和结合Mamba时序建模的专用网络,成功实现了从视觉数据中重建声音信号。
English Summary: This study introduces a novel non-contact sound recovery pipeline that utilizes event cameras to capture high-frequency visual vibrations, employing a simulation-based training approach and a specialized network with Mamba for temporal modeling to effectively reconstruct sound from visual data.

Authors:Zhiling Zhu, Tieming Chen, Chengwei Liu, Han Liu, Qijie Song, Zhengzi Xu, Yang Liu
Title: Doctor: Optimizing Container Rebuild Efficiency by Instruction Re-Orchestration
Abstract:
Containerization has revolutionized software deployment, with Docker leading the way due to its ease of use and consistent runtime environment. As Docker usage grows, optimizing Dockerfile performance, particularly by reducing rebuild time, has become essential for maintaining efficient CI/CD pipelines. However, existing optimization approaches primarily address single builds without considering the recurring rebuild costs associated with modifications and evolution, limiting long-term efficiency gains. To bridge this gap, we present Doctor, a method for improving Dockerfile build efficiency through instruction re-ordering that addresses key challenges: identifying instruction dependencies, predicting future modifications, ensuring behavioral equivalence, and managing the optimization computational complexity. We developed a comprehensive dependency taxonomy based on Dockerfile syntax and a historical modification analysis to prioritize frequently modified instructions. Using a weighted topological sorting algorithm, Doctor optimizes instruction order to minimize future rebuild time while maintaining functionality. Experiments on 2,000 GitHub repositories show that Doctor improves 92.75% of Dockerfiles, reducing rebuild time by an average of 26.5%, with 12.82% of files achieving over a 50% reduction. Notably, 86.2% of cases preserve functional similarity. These findings highlight best practices for Dockerfile management, enabling developers to enhance Docker efficiency through informed optimization strategies.
中文: Doctor提出了一种Dockerfile优化方法,通过指令重排序结合依赖分析和修改预测,在保持功能等效的同时平均减少26.5%的重建时间,显著提升了容器构建效率。
English: Doctor introduces a Dockerfile optimization method that reorders instructions based on dependency analysis and modification prediction, achieving an average 26.5% rebuild time reduction while maintaining functional equivalence in most cases.

Authors:Fubao Zhu, Yang Zhang, Gengmin Liang, Jiaofen Nan, Yanting Li, Chuang Han, Danyang Sun, Zhiguo Wang, Chen Zhao, Wenxuan Zhou, Jian He, Yi Xu, Iokfai Cheang, Xu Zhu, Yanli Zhou, Weihua Zhou
Title: Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network
Abstract:
Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses.
中文: 本研究开发并验证了一种结合图卷积网络、卷积神经网络和Transformer的深度学习模型,通过多模态数据对肺动脉高压进行分类,测试集AUC达0.81,具备辅助临床决策的潜力。
English: This study developed and validated a deep learning model combining GCN, CNN, and Transformers to classify pulmonary hypertension using multimodal data, achieving an AUC of 0.81 and demonstrating potential for clinical decision support.

Authors:Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar
Title: Multi-Token Attention
Abstract:
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.
中文: 提出的多令牌注意力(MTA)方法通过卷积操作整合多个查询和键向量,使LLM能基于更丰富信息定位上下文,在多项基准测试中表现优异,尤其在长上下文任务中优势显著。
English: The proposed Multi-Token Attention (MTA) method enhances LLMs by using convolution to incorporate multiple query and key vectors for more nuanced context localization, achieving superior performance across benchmarks, especially in long-context tasks.

Authors:Wenbo Nie, Lang Nie, Chunyu Lin, Jingwen Chen, Ke Xing, Jiyuan Wang, Kang Liao
Title: Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal Adaptation
Abstract:
Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose a structure-to-detail portrait correction model named ImagePC. It integrates the long-range awareness of the transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePC for unlabeled wide-angle videos (termed VideoPC), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePC, VideoPC maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially in blind scenarios. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in the number of people, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.
中文: 提出的ImagePC和VideoPC模型通过结合Transformer和扩散技术,有效校正广角图像和视频中的人脸畸变,实现了卓越的结构与细节修复效果,并具备更优的时间稳定性。
English: The proposed ImagePC and VideoPC models effectively correct facial distortions in wide-angle images and videos by combining transformer and diffusion techniques, achieving superior structural and detail restoration with enhanced temporal stability.

Authors:Qirui Mi, Mengyue Yang, Xiangning Yu, Zhiyu Zhao, Cheng Deng, Bo An, Haifeng Zhang, Xu Chen, Jun Wang
Title: MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework
Abstract:
Simulating collective decision-making involves more than aggregating individual behaviors; it emerges from dynamic interactions among individuals. While large language models (LLMs) offer strong potential for social simulation, achieving quantitative alignment with real-world data remains a key challenge. To bridge this gap, we propose the Mean-Field LLM (MF-LLM) framework, the first to incorporate mean field theory into LLM-based social simulation. MF-LLM models bidirectional interactions between individuals and the population through an iterative process, generating population signals to guide individual decisions, which in turn update the signals. This interplay produces coherent trajectories of collective behavior. To improve alignment with real-world data, we introduce IB-Tune, a novel fine-tuning method inspired by the Information Bottleneck principle, which retains population signals most predictive of future actions while filtering redundant history. Evaluated on a real-world social dataset, MF-LLM reduces KL divergence to human population distributions by 47\% compared to non-mean-field baselines, enabling accurate trend forecasting and effective intervention planning. Generalizing across 7 domains and 4 LLM backbones, MF-LLM provides a scalable, high-fidelity foundation for social simulation.
中文:MF-LLM框架首次将平均场理论与大语言模型结合,通过个体与群体的双向交互模拟集体决策过程,在真实社会数据集上相比基线模型将KL散度降低47%,并能在多个领域实现精准趋势预测。
English: The MF-LLM framework integrates mean field theory with large language models to simulate collective decision-making through bidirectional individual-population interactions, achieving a 47% reduction in KL divergence from real-world data while enabling accurate trend forecasting across multiple domains.

Authors:Zhiting Fan, Ruizhe Chen, Zuozhu Liu
Title: BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models
Abstract:
Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.
中文: BiasGuard是一种新颖的两阶段偏见检测工具,通过显式推理和强化学习超越现有方法,在多个数据集上提高了准确性并减少了过度公平误判。
English: BiasGuard is a novel two-stage bias detection tool that uses explicit reasoning and reinforcement learning to surpass existing methods in accuracy and reduce over-fairness misjudgments across multiple datasets.

Authors:Xuyan Ma, Yawen Wang, Junjie Wang, Xiaofei Xie, Boyu Wu, Shoubin Li, Fanjiang Xu, Qing Wang
Title: Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training
Abstract:
In typical multi-agent reinforcement learning (MARL) problems, communication is important for agents to share information and make the right decisions. However, due to the complexity of training multi-agent communication, existing methods often fall into the dilemma of local optimization, which leads to the concentration of communication in a limited number of channels and presents an unbalanced structure. Such unbalanced communication policy are vulnerable to abnormal conditions, where the damage of critical communication channels can trigger the crash of the entire system. Inspired by decentralization theory in sociology, we propose DMAC, which enhances the robustness of multi-agent communication policies by retraining them into decentralized patterns. Specifically, we train an adversary DMAC\_Adv which can dynamically identify and mask the critical communication channels, and then apply the adversarial samples generated by DMAC\_Adv to the adversarial learning of the communication policy to force the policy in exploring other potential communication schemes and transition to a decentralized structure. As a training method to improve robustness, DMAC can be fused with any learnable communication policy algorithm. The experimental results in two communication policies and four multi-agent tasks demonstrate that DMAC achieves higher improvement on robustness and performance of communication policy compared with two state-of-the-art and commonly-used baselines. Also, the results demonstrate that DMAC can achieve decentralized communication structure with acceptable communication cost.
中文: DMAC通过对抗性学习将多智能体通信策略重新训练为去中心化模式,增强了其鲁棒性,有效应对关键通信渠道故障,提升了系统性能和稳定性。
English: DMAC enhances the robustness of multi-agent communication policies by retraining them into decentralized patterns using adversarial learning, improving performance and resilience against critical channel failures.

Authors:Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng
Title: FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
Abstract:
Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
中文: FiLA-Video采用轻量级动态权重多帧融合策略和关键帧选择方法,结合高效训练数据生成,在长视频理解中实现了更优的效率和准确性。
English: FiLA-Video introduces a lightweight dynamic-weight multi-frame fusion strategy with keyframe selection and efficient training data generation, achieving superior efficiency and accuracy in long-video comprehension.

Authors:Yongxuan Han, Shengzhong Liu, Fan Wu, Guihai Chen
Title: ABO: Abandon Bayer Filter for Adaptive Edge Offloading in Responsive Augmented Reality
Abstract:
Bayer-patterned color filter array (CFA) has been the go-to solution for color image sensors. In augmented reality (AR), although color interpolation (i.e., demosaicing) of pre-demosaic RAW images facilitates a user-friendly rendering, it creates no benefits in offloaded DNN analytics but increases the image channels by 3 times inducing higher transmission overheads. The potential optimization in frame preprocessing of DNN offloading is yet to be investigated. To that end, we propose ABO, an adaptive RAW frame offloading framework that parallelizes demosaicing with DNN computation. Its contributions are three-fold: First, we design a configurable tile-wise RAW image neural codec to compress frame sizes while sustaining downstream DNN accuracy under bandwidth constraints. Second, based on content-aware tiles-in-frame selection and runtime bandwidth estimation, a dynamic transmission controller adaptively calibrates codec configurations to maximize the DNN accuracy. Third, we further optimize the system pipelining to achieve lower end-to-end frame processing latency and higher throughput. Through extensive evaluations on a prototype platform, ABO consistently achieves 40% more frame processing throughput and 30% less end-to-end latency while improving the DNN accuracy by up to 15% than SOTA baselines. It also exhibits improved robustness against dim lighting and motion blur situations.
Chinese: ABO是一种自适应RAW帧卸载框架,通过将去马赛克与DNN计算并行化,相比现有技术实现了40%的吞吐量提升、30%的延迟降低,并将DNN精度最高提升15%。
English: ABO is an adaptive RAW frame offloading framework that parallelizes demosaicing with DNN computation, enhancing throughput by 40%, reducing latency by 30%, and improving DNN accuracy by up to 15% compared to existing methods.

Authors:Junpeng Jiang, Gangyi Hong, Miao Zhang, Hengtong Hu, Kun Zhan, Rui Shao, Liqiang Nie
Title: DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer
Abstract:
Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driving scenarios. To address this gap, we propose DiVE, a diffusion transformer-based generative framework meticulously engineered to produce high-fidelity, temporally coherent, and cross-view consistent multi-view videos, aligning seamlessly with bird's-eye view layouts and textual descriptions. DiVE leverages a unified cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. Despite these advancements, synthesizing high-resolution videos under multimodal constraints introduces dual challenges: investigating the optimal classifier-free guidance coniguration under intricate multi-condition inputs and mitigating excessive computational latency in high-resolution rendering--both of which remain underexplored in prior researches. To resolve these limitations, we introduce two innovations: Multi-Control Auxiliary Branch Distillation, which streamlines multi-condition CFG selection while circumventing high computational overhead, and Resolution Progressive Sampling, a training-free acceleration strategy that staggers resolution scaling to reduce high latency due to high resolution. These innovations collectively achieve a 2.62x speedup with minimal quality degradation. Evaluated on the nuScenes dataset, DiVE achieves SOTA performance in multi-view video generation, yielding photorealistic outputs with exceptional temporal and cross-view coherence.
中文: DiVE是一种基于扩散变换器的生成框架,通过创新的注意力机制和两项技术突破——多控制辅助分支蒸馏与分辨率渐进采样,能够生成高质量、时空一致的多视角驾驶视频,在实现顶尖性能的同时大幅提升了处理速度。
English: DiVE is a diffusion transformer-based framework that generates high-quality, spatiotemporally consistent multi-view driving videos by leveraging novel attention mechanisms and introducing two innovations—Multi-Control Auxiliary Branch Distillation and Resolution Progressive Sampling—to achieve state-of-the-art performance with significantly accelerated processing.

Authors:Yu Li, Qizhi Pei, Mengyuan Sun, Honglin Lin, Chenlin Ming, Xin Gao, Jiang Wu, Conghui He, Lijun Wu
Title: CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning. These findings underscore the need for continuous advancements in LLM reasoning capabilities.
中文: 本文提出CipherBank基准测试,用于评估大语言模型在密码解密任务中的推理能力,揭示了当前模型在密码学推理方面存在显著不足,强调了提升该领域能力的必要性。
English: This paper introduces CipherBank, a comprehensive benchmark for evaluating LLMs' reasoning abilities in cryptographic decryption tasks, revealing significant performance gaps in current models and highlighting the need for improved cryptographic reasoning capabilities.

Authors:Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Title: 4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis
Abstract:
Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. To bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.
中文:M2M-AlignNet是一种几何感知的多模态协同注意力网络,通过对比学习和潜在对齐融合fMRI与sMRI数据,利用几何加权补丁对应关系减少表征差异并自主发现融合模式,从而提升早期阿尔茨海默病的诊断效果。
English: M2M-AlignNet is a geometry-aware multimodal co-attention network that aligns fMRI and sMRI data through contrastive learning and latent alignment to enhance early Alzheimer's disease diagnosis by reducing representational discrepancies and autonomously discovering fusion patterns.

Authors:Andrea Conti, Matteo Poggi, Valerio Cambareri, Martin R. Oswald, Stefano Mattoccia
Title: ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration
Abstract:
Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
Chinese: ToF-Splatting是一种基于3D高斯泼溅的新型SLAM流程,通过融合多帧线索有效利用极度稀疏的飞行时间深度数据生成稠密深度图,在跟踪与建图方面实现了最先进的性能表现。
English: ToF-Splatting is a novel 3D Gaussian Splatting-based SLAM pipeline that effectively utilizes extremely sparse Time-of-Flight depth data by integrating multi-frame cues to produce dense depth maps, achieving state-of-the-art performance in tracking and mapping.

Authors:Yang Cao, Wenchi Cheng, Jingqing Wang, Wei Zhang
Title: Active Reconfigurable Intelligent Surface Assisted MIMO: Electromagnetic-Compliant Modeling with Mutual Coupling
Abstract:
Reconfigurable Intelligent Surfaces (RIS) represent a transformative technology for sixth-generation (6G) wireless communications, but it suffers from a significant limitation, namely the double-fading attenuation. Active RIS has emerged as a promising solution, effectively mitigating the attenuation issues associated with conventional RIS-assisted systems. However, the current academic work on active RIS focuses on the system-level optimization of active RIS, often overlooking the development of models that are compatible with its electromagnetic (EM) and physical properties. The challenge of constructing realistic, EM-compliant models for active RIS-assisted communication, as well as understanding their implications on system-level optimization, remains an open research area. To tackle these problems, in this paper we develop a novel EM-compliant model with mutual coupling (MC) for active RIS-assisted wireless systems by integrating the developed scattering-parameter ($S$-parameter) based active RIS framework with multiport network theory, which facilitates system-level analysis and optimization. To evaluate the performance of the EM-compliant active RIS model, we design the joint optimization scheme based on the transmit beamforming at the transmitter and the reflection coefficient at the active RIS to maximize the achievable rate of EM-compliant active RIS-assisted MIMO system. To tackle the inherent non-convexity of this problem, we employ the Sherman-Morrison inversion and Neumann series (SMaN)-based alternating optimization (AO) algorithm. Simulation results verified that EM property (i.e., MC effect) is an indispensable factor in the optimization process of MIMO systems. Neglecting this effect introduces a substantial performance gap, highlighting its significance in the more pronounced the MC effect is, the greater the gap in achievable rates.
中文: 本文针对有源智能超表面辅助无线系统,开发了一种兼容电磁特性并包含互耦效应的新型模型,通过结合S参数框架与多端口网络理论,并设计联合优化算法来最大化MIMO系统可达速率,解决了现有模型因忽略电磁特性而导致的性能差距问题。
English: This paper introduces an electromagnetic-compliant model with mutual coupling for active RIS-assisted wireless systems, addressing performance gaps in current models by integrating S-parameter frameworks with multiport network theory and developing a joint optimization algorithm to maximize MIMO system rates.

Authors:Meng Wang, Tian Lin, Qingshan Hou, Aidi Lin, Jingcheng Wang, Qingsheng Peng, Truong X. Nguyen, Danqi Fang, Ke Zou, Ting Xu, Cancan Xue, Ten Cheer Quek, Qinkai Yu, Minxin Liu, Hui Zhou, Zixuan Xiao, Guiqin He, Huiyu Liang, Tingkun Shi, Man Chen, Linna Liu, Yuanyuan Peng, Lianyu Wang, Qiuming Hu, Junhong Chen, Zhenhua Zhang, Cheng Chen, Yitian Zhao, Dianbo Liu, Jianhua Wu, Xinjian Chen, Changqing Zhang, Triet Thanh Nguyen, Yanda Meng, Yalin Zheng, Yih Chung Tham, Carol Y. Cheung, Huazhu Fu, Haoyu Chen, Ching-Yu Cheng
Title: A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers
Abstract:
Artificial intelligence (AI) shows remarkable potential in medical imaging diagnostics, yet most current models require retraining when applied across different clinical settings, limiting their scalability. We introduce GlobeReady, a clinician-friendly AI platform that enables fundus disease diagnosis that operates without retraining, fine-tuning, or the needs for technical expertise. GlobeReady demonstrates high accuracy across imaging modalities: 93.9-98.5% for 11 fundus diseases using color fundus photographs (CPFs) and 87.2-92.7% for 15 fundus diseases using optic coherence tomography (OCT) scans. By leveraging training-free local feature augmentation, GlobeReady platform effectively mitigates domain shifts across centers and populations, achieving accuracies of 88.9-97.4% across five centers on average in China, 86.3-96.9% in Vietnam, and 73.4-91.0% in Singapore, and 90.2-98.9% in the UK. Incorporating a bulit-in confidence-quantifiable diagnostic mechanism further enhances the platform's accuracy to 94.9-99.4% with CFPs and 88.2-96.2% with OCT, while enabling identification of out-of-distribution cases with 86.3% accuracy across 49 common and rare fundus diseases using CFPs, and 90.6% accuracy across 13 diseases using OCT. Clinicians from countries rated GlobeReady highly for usability and clinical relevance (average score 4.6/5). These findings demonstrate GlobeReady's robustness, generalizability and potential to support global ophthalmic care without technical barriers.
中文: GlobeReady是一种无需重新训练即可在不同临床环境中准确诊断眼底疾病的AI平台,通过免训练局部特征增强有效应对领域偏移,展现出高精度和临床实用性。
English: GlobeReady is a clinician-friendly AI platform that enables accurate fundus disease diagnosis across diverse clinical settings without requiring retraining, demonstrating high accuracy and usability while effectively mitigating domain shifts.

Authors:Chaoyue Niu, Yucheng Ding, Junhui Lu, Zhengxiang Huang, Hang Zeng, Yutong Dai, Xuezhen Tu, Chengfei Lv, Fan Wu, Guihai Chen
Title: Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
Abstract:
The conventional cloud-based large model learning framework is increasingly constrained by latency, cost, personalization, and privacy concerns. In this survey, we explore an emerging paradigm: collaborative learning between on-device small model and cloud-based large model, which promises low-latency, cost-efficient, and personalized intelligent services while preserving user privacy. We provide a comprehensive review across hardware, system, algorithm, and application layers. At each layer, we summarize key problems and recent advances from both academia and industry. In particular, we categorize collaboration algorithms into data-based, feature-based, and parameter-based frameworks. We also review publicly available datasets and evaluation metrics with user-level or device-level consideration tailored to collaborative learning settings. We further highlight real-world deployments, ranging from recommender systems and mobile livestreaming to personal intelligent assistants. We finally point out open research directions to guide future development in this rapidly evolving field.
Chinese: 本综述提出了一种设备端小模型与云端大模型协同学习的新范式,以解决延迟、成本和隐私问题,从硬件、系统、算法和应用层面综述了研究进展,并指出了实际部署案例和未来研究方向。
English: This survey introduces a collaborative learning paradigm between on-device small models and cloud-based large models to address latency, cost, and privacy issues, reviewing advances across hardware, systems, algorithms, and applications while highlighting real-world deployments and future research directions.

Authors:Yizhu Jiao, Xuchao Zhang, Zhaoyang Wang, Yubo Ma, Zhun Deng, Rujia Wang, Chetan Bansal, Saravan Rajmohan, Jiawei Han, Huaxiu Yao
Title: Synergistic Weak-Strong Collaboration by Aligning Preferences
Abstract:
Current Large Language Models (LLMs) excel in general reasoning yet struggle with specialized tasks requiring proprietary or domain-specific knowledge. Fine-tuning large models for every niche application is often infeasible due to black-box constraints and high computational overhead. To address this, we propose a collaborative framework that pairs a specialized weak model with a general strong model. The weak model, tailored to specific domains, produces initial drafts and background information, while the strong model leverages its advanced reasoning to refine these drafts, extending LLMs' capabilities to critical yet specialized tasks. To optimize this collaboration, we introduce a collaborative feedback to fine-tunes the weak model, which quantifies the influence of the weak model's contributions in the collaboration procedure and establishes preference pairs to guide preference tuning of the weak model. We validate our framework through experiments on three domains. We find that the collaboration significantly outperforms each model alone by leveraging complementary strengths. Moreover, aligning the weak model with the collaborative preference further enhances overall performance.
中文: 本研究提出一种协作框架,让专业弱模型与通用强模型协同工作,弱模型提供领域特定草稿,强模型进行优化,通过优势互补显著超越单一模型性能。
English: This study introduces a collaborative framework where a specialized weak model and a general strong model work together, with the weak model providing domain-specific drafts and the strong model refining them, significantly outperforming individual models by leveraging their complementary strengths.

Authors:Philipp Altmann, Céline Davignon, Maximilian Zorn, Fabian Ritz, Claudia Linnhoff-Popien, Thomas Gabor
Title: Surrogate Fitness Metrics for Interpretable Reinforcement Learning
Abstract:
We employ an evolutionary optimization framework that perturbs initial states to generate informative and diverse policy demonstrations. A joint surrogate fitness function guides the optimization by combining local diversity, behavioral certainty, and global population diversity. To assess demonstration quality, we apply a set of evaluation metrics, including the reward-based optimality gap, fidelity interquartile means (IQMs), fitness composition analysis, and trajectory visualizations. Hyperparameter sensitivity is also examined to better understand the dynamics of trajectory optimization. Our findings demonstrate that optimizing trajectory selection via surrogate fitness metrics significantly improves interpretability of RL policies in both discrete and continuous environments. In gridworld domains, evaluations reveal significantly enhanced demonstration fidelities compared to random and ablated baselines. In continuous control, the proposed framework offers valuable insights, particularly for early-stage policies, while fidelity-based optimization proves more effective for mature policies. By refining and systematically analyzing surrogate fitness functions, this study advances the interpretability of RL models. The proposed improvements provide deeper insights into RL decision-making, benefiting applications in safety-critical and explainability-focused domains.
Chinese: 本研究采用进化优化框架和代理适应度函数生成多样化的策略演示,通过改进轨迹选择显著增强了离散和连续环境中强化学习策略的可解释性。
English: This study uses an evolutionary optimization framework with a surrogate fitness function to generate diverse policy demonstrations, significantly enhancing RL policy interpretability in both discrete and continuous environments through improved trajectory selection.

Authors:Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu
Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Abstract:
Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.
中文: FairSteer是一种新颖的推理时去偏框架,通过检测激活中的偏见特征并利用计算出的导向向量进行调整,无需定制提示或模型重训练,在多项任务中展现出优越性能。
English: FairSteer is a novel inference-time debiasing framework that detects bias signatures in activations and adjusts them using computed steering vectors, eliminating the need for prompt customization or model retraining while demonstrating superior performance across multiple tasks.

Authors:Yi Sun, Han Wang, Jiaqiang Li, Jiacheng Liu, Xiangyu Li, Hao Wen, Yizhen Yuan, Huiwen Zheng, Yan Liang, Yuanchun Li, Yunxin Liu
Title: An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint
Abstract:
Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.
中文摘要:大型语言模型在严格输出长度限制下,其推理能力表现与无约束时不同,需根据实际延迟预算调整模型规模和提示方式以获得最佳效果。
English Summary: Large Language Models show improved accuracy with extended reasoning but face challenges under strict output length constraints, where optimal model choices differ from unconstrained scenarios.

Authors:Xiaotian Zhang, Ruizhe Chen, Yang Feng, Zuozhu Liu
Title: Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Abstract:
Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment. Our code is available here.
中文摘要:Persona-judge提出了一种无需训练的区别性范式,通过利用模型内在偏好判断能力实现个性化对齐,无需外部奖励即可达成可扩展且计算高效的定制化方案。
English Summary: Persona-judge introduces a training-free discriminative approach for personalized alignment by utilizing models' intrinsic preference judgment, enabling scalable and computationally efficient customization without external rewards.

Authors:Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo Zheng
Title: GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Abstract:
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
中文: GeoSense作为首个全面双语基准,旨在系统评估多模态大语言模型的几何推理能力,实验发现尽管Gemini-2.0-pro-flash以65.3分领先,但几何原理的识别与应用仍是制约模型推理能力的关键瓶颈。
English: GeoSense is introduced as the first comprehensive bilingual benchmark to systematically evaluate multimodal large language models' geometric reasoning abilities, revealing that identifying and applying geometric principles remains a bottleneck despite Gemini-2.0-pro-flash achieving the highest score of 65.3.

Authors:Zheng Zhang, Ning Li, Qi Liu, Rui Li, Weibo Gao, Qingyang Mao, Zhenya Huang, Baosheng Yu, Dacheng Tao
Title: The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems from various perspectives. While these advancements have yielded significant results, the application of RAG in domains with considerable societal implications raises a critical question about fairness: What impact does the introduction of the RAG paradigm have on the fairness of LLMs? To address this question, we conduct extensive experiments by varying the LLMs, retrievers, and retrieval sources. Our experimental analysis reveals that the scale of the LLMs plays a significant role in influencing fairness outcomes within the RAG framework. When the model scale is smaller than 8B, the integration of retrieval mechanisms often exacerbates unfairness in small-scale LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness issues introduced by RAG for small-scale LLMs, we propose two approaches, FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the LLM in terms of fairness, enabling it to retrieve documents that facilitate fairer model outputs. In FairFilter, we propose a fairness filtering mechanism to filter out biased content after retrieval. Finally, we validate our proposed approaches on real-world datasets, demonstrating their effectiveness in improving fairness while maintaining performance.
中文: 检索增强生成(RAG)通过引入外部知识提升大语言模型性能,但会加剧小规模模型的不公平性,为此提出的FairFT和FairFilter方法在保持性能的同时有效改善了公平性。
English: Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but can worsen fairness in smaller models, prompting the development of FairFT and FairFilter methods to mitigate bias while preserving performance.

Authors:Drishti Goel, Raghav Magazine, Supriyo Ghosh, Akshay Nambi, Prathamesh Deshpande, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan
Title: eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization
Abstract:
Root cause analysis (RCA) for incidents in large-scale cloud systems is a complex, knowledge-intensive task that often requires significant manual effort from on-call engineers (OCEs). Improving RCA is vital for accelerating the incident resolution process and reducing service downtime and manual efforts. Recent advancements in Large-Language Models (LLMs) have proven to be effective in solving different stages of the incident management lifecycle including RCA. However, existing LLM-based RCA recommendations typically leverage default finetuning or retrieval augmented generation (RAG) methods with static, manually designed prompts, which lead to sub-optimal recommendations. In this work, we leverage 'PromptWizard', a state-of-the-art prompt optimization technique, to automatically identify the best optimized prompt instruction that is combined with semantically similar historical examples for querying underlying LLMs during inference. Moreover, by utilizing more than 180K historical incident data from Microsoft, we developed cost-effective finetuned small language models (SLMs) for RCA recommendation generation and demonstrate the power of prompt optimization on such domain-adapted models. Our extensive experimental results show that prompt optimization can improve the accuracy of RCA recommendations by 21% and 13% on 3K test incidents over RAG-based LLMs and finetuned SLMs, respectively. Lastly, our human evaluation with incident owners have demonstrated the efficacy of prompt optimization on RCA recommendation tasks. These findings underscore the advantages of incorporating prompt optimization into AI for Operations (AIOps) systems, delivering substantial gains without increasing computational overhead.
中文: 提示优化显著提升了云系统根本原因分析的推荐准确性,相比传统方法最高可提高21%的准确率,且无需增加计算开销。
English: Prompt optimization significantly enhances root cause analysis recommendations in cloud systems, boosting accuracy by up to 21% over traditional methods without added computational costs.

Authors:Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Guihai Chen, Jie Wu
Title: Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance
Abstract:
Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.
中文:DRAGON框架通过分布式检索增强生成方法,整合通用与个人知识来提升设备端小型语言模型的性能,在保护文档隐私的同时显著提高效率并降低延迟。
English: The DRAGON framework enhances on-device small language models by integrating both general and personal knowledge through a distributed retrieval-augmented generation approach, significantly improving performance while preserving document privacy and reducing latency.

Authors:Tzu-Yun Tseng, Hongyu Lyu, Josephine Li, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: M2S-RoAD: Multi-Modal Semantic Segmentation for Road Damage Using Camera and LiDAR Data
Abstract:
Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.
中文: 本文提出了M2S-RoAD数据集,专注于农村地区九类道路损坏的语义分割,以弥补当前自动驾驶研究中乡村道路检测的不足。
English: This paper introduces M2S-RoAD, a dataset for semantic segmentation of road damage in rural areas, addressing the research gap in automated detection for improved autonomous vehicle safety.

Authors:Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, Bingsheng He
Title: Assessing Judging Bias in Large Reasoning Models: An Empirical Study
Abstract:
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.
中文: 研究表明,大型推理模型尽管具备先进推理能力,仍易受多种偏见影响,并通过设计专用提示和自我反思等缓解策略,显著提升了自动化评判框架的可靠性。
English: This study reveals that Large Reasoning Models (LRMs) remain susceptible to various biases despite their advanced reasoning capabilities, and proposes mitigation strategies like specialized prompts and self-reflection to enhance reliability in automated judging frameworks.

Authors:Amin Vahidi-Moghaddam, Kaian Chen, Kaixiang Zhang, Zhaojian Li, Yan Wang, Kai Wu
Title: Safe Data-Driven Predictive Control
Abstract:
In the realm of control systems, model predictive control (MPC) has exhibited remarkable potential; however, its reliance on accurate models and substantial computational resources has hindered its broader application, especially within real-time nonlinear systems. This study presents an innovative control framework to enhance the practical viability of the MPC. The developed safe data-driven predictive control aims to eliminate the requirement for precise models and alleviate computational burdens in the nonlinear MPC (NMPC). This is achieved by learning both the system dynamics and the control policy, enabling efficient data-driven predictive control while ensuring system safety. The methodology involves a spatial temporal filter (STF)-based concurrent learning for system identification, a robust control barrier function (RCBF) to ensure the system safety amid model uncertainties, and a RCBF-based NMPC policy approximation. An online policy correction mechanism is also introduced to counteract performance degradation caused by the existing model uncertainties. Demonstrated through simulations on two applications, the proposed approach offers comparable performance to existing benchmarks with significantly reduced computational costs.
中文摘要:本研究提出了一种安全的数据驱动预测控制框架,通过同时学习系统动力学和控制策略,结合鲁棒控制屏障函数和在线策略校正机制,在确保系统安全的同时消除了非线性模型预测控制对精确模型的依赖并显著降低了计算负担。
English Summary: This study introduces a safe data-driven predictive control framework that eliminates the need for precise models and reduces computational burdens in nonlinear MPC by learning system dynamics and control policies while ensuring safety through robust control barrier functions and online policy correction.

Authors:Chaojian Li, Zhifan Ye, Massimiliano Lupo Pasini, Jong Youl Choi, Cheng Wan, Yingyan Celine Lin, Prasanna Balaprakash
Title: Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling
Abstract:
Atomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.
中文摘要:本研究通过开发数十亿参数模型并在太字节级数据集上训练,探索了图神经网络在原子尺度材料建模中的扩展极限,建立了扩展规律并证明了大语言模型技术可有效提升图神经网络的性能。
English Summary: This study explores scaling graph neural networks (GNNs) for atomistic materials modeling by developing billion-parameter models trained on terabyte-scale datasets, establishing scaling laws and demonstrating how large language model techniques can enhance GNN performance.

Authors:Hongyu Lyu, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: SydneyScapes: Image Segmentation for Australian Environments
Abstract:
Autonomous Vehicles (AVs) are being partially deployed and tested across various global locations, including China, the USA, Germany, France, Japan, Korea, and the UK, but with limited demonstrations in Australia. The integration of machine learning (ML) into AV perception systems highlights the need for locally labelled datasets to develop and test algorithms in specific environments. To address this, we introduce SydneyScapes - a dataset tailored for computer vision tasks of image semantic, instance, and panoptic segmentation. This dataset, collected from Sydney and surrounding cities in New South Wales (NSW), Australia, consists of 756 images with high-quality pixel-level annotations. It is designed to assist AV industry and researchers by providing annotated data and tools for algorithm development, testing, and deployment in the Australian context. Additionally, we offer benchmarking results using state-of-the-art algorithms to establish reference points for future research and development. The dataset is publicly available at https://hdl.handle.net/2123/33051.
中文: SydneyScapes数据集包含756张来自悉尼及周边地区的高质量标注图像,旨在满足澳大利亚本土自动驾驶感知算法开发与测试的需求,并提供基准结果以支持未来研究。
English: The SydneyScapes dataset, comprising 756 high-quality annotated images from Sydney and surrounding areas, addresses the need for localized data to develop and test autonomous vehicle perception algorithms in Australia, with benchmarking results provided for future research.

Authors:Amin Vahidi-Moghaddam, Keyi Zhu, Kaixiang Zhang, Ziyou Song, Zhaojian Li
Title: Data-Enabled Neighboring Extremal: Case Study on Model-Free Trajectory Tracking for Robotic Arm
Abstract:
Data-enabled predictive control (DeePC) has recently emerged as a powerful data-driven approach for efficient system controls with constraints handling capabilities. It performs optimal controls by directly harnessing input-output (I/O) data, bypassing the process of explicit model identification that can be costly and time-consuming. However, its high computational complexity, driven by a large-scale optimization problem (typically in a higher dimension than its model-based counterpart--Model Predictive Control), hinders real-time applications. To overcome this limitation, we propose the data-enabled neighboring extremal (DeeNE) framework, which significantly reduces computational cost while preserving control performance. DeeNE leverages first-order optimality perturbation analysis to efficiently update a precomputed nominal DeePC solution in response to changes in initial conditions and reference trajectories. We validate its effectiveness on a 7-DoF KINOVA Gen3 robotic arm, demonstrating substantial computational savings and robust, data-driven control performance.
中文: 提出的数据驱动邻近极值(DeeNE)框架通过高效更新预计算解,克服了数据驱动预测控制的计算瓶颈,在保持鲁棒控制性能的同时显著降低计算成本,并在机械臂上验证了其有效性。
English: The proposed data-enabled neighboring extremal (DeeNE) framework overcomes the computational limitations of data-enabled predictive control by efficiently updating precomputed solutions, achieving significant computational savings while maintaining robust control performance as validated on a robotic arm.

Authors:Shahryar Zehtabi, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton
Title: Decentralized Domain Generalization with Style Sharing: Formal Model and Convergence Analysis
Abstract:
Much of federated learning (FL) focuses on settings where local dataset statistics remain the same between training and testing. However, this assumption often does not hold in practice due to distribution shifts, motivating the development of domain generalization (DG) approaches that leverage source domain data to train models capable of generalizing to unseen target domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives; and (2) DG research in FL being limited to the star-topology architecture. We develop Decentralized Federated Domain Generalization with Style Sharing ($\textit{StyleDDG}$), a decentralized DG algorithm which allows devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we provide the first systematic approach to analyzing style-based DG training in decentralized networks. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model $\textit{StyleDDG}$. We then obtain analytical conditions under which convergence of $\textit{StyleDDG}$ can be guaranteed. Through experiments on popular DG datasets, we demonstrate that $\textit{StyleDDG}$ can obtain significant improvements in accuracy across target domains with minimal communication overhead compared to baseline decentralized gradient methods.
中文摘要:本文提出StyleDDG算法,通过设备间共享数据集风格信息实现去中心化联邦学习的领域泛化,并首次对该类方法提供了收敛性保证的形式化分析。
English Summary: This paper introduces StyleDDG, a decentralized federated learning algorithm that enables devices in peer-to-peer networks to achieve domain generalization by sharing style information from their datasets, while providing the first formal analysis of convergence guarantees for such approaches.

Authors:Shahryar Zehtabi, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton
Title: Decentralized Domain Generalization with Style Sharing: Formal Model and Convergence Analysis
Abstract:
Much of federated learning (FL) focuses on settings where local dataset statistics remain the same between training and testing. However, this assumption often does not hold in practice due to distribution shifts, motivating the development of domain generalization (DG) approaches that leverage source domain data to train models capable of generalizing to unseen target domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives; and (2) DG research in FL being limited to the star-topology architecture. We develop Decentralized Federated Domain Generalization with Style Sharing ($\textit{StyleDDG}$), a decentralized DG algorithm which allows devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we provide the first systematic approach to analyzing style-based DG training in decentralized networks. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model $\textit{StyleDDG}$. We then obtain analytical conditions under which convergence of $\textit{StyleDDG}$ can be guaranteed. Through experiments on popular DG datasets, we demonstrate that $\textit{StyleDDG}$ can obtain significant improvements in accuracy across target domains with minimal communication overhead compared to baseline decentralized gradient methods.
中文摘要:本文提出StyleDDG算法,通过设备间共享数据集风格信息实现去中心化联邦学习的领域泛化,并首次对该类方法提供了收敛性保证的形式化分析。
English Summary: This paper introduces StyleDDG, a decentralized federated learning algorithm that enables devices in peer-to-peer networks to achieve domain generalization by sharing style information from their datasets, while providing the first formal analysis of convergence guarantees for such approaches.

Authors:Aniket Deroy, Subhankar Maity
Title: STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation
Abstract:
Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.
中文摘要:STRIVE是一种新颖的自动化方法,通过多轮大型语言模型的迭代优化来评估问题质量,显著提升了评估准确性及与人工判断的相关性。
English Summary: STRIVE is a novel automated method using multiple Large Language Models to evaluate question quality through iterative refinement, improving accuracy and correlation with human judgments.

Authors:Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Title: Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?
Abstract:
This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.
Chinese: 本研究评估了预训练大语言模型在人力资源面试评估中与人类专家的表现,发现尽管GPT-4 Turbo等模型在评分方面可与人类媲美,但在错误识别和具体改进建议上存在不足,因此需要采用人机协同策略来确保评估可靠性。
English: This study evaluates leading pre-trained large language models against human experts in HR interview assessments, finding that while models like GPT-4 Turbo can match human scoring, they struggle with error identification and actionable feedback, necessitating a human-in-the-loop approach for reliable deployment.

Authors:Subhankar Maity, Aniket Deroy
Title: Leveraging Prompt-Tuning for Bengali Grammatical Error Explanation Using Large Language Models
Abstract:
We propose a novel three-step prompt-tuning method for Bengali Grammatical Error Explanation (BGEE) using state-of-the-art large language models (LLMs) such as GPT-4, GPT-3.5 Turbo, and Llama-2-70b. Our approach involves identifying and categorizing grammatical errors in Bengali sentences, generating corrected versions of the sentences, and providing natural language explanations for each identified error. We evaluate the performance of our BGEE system using both automated evaluation metrics and human evaluation conducted by experienced Bengali language experts. Our proposed prompt-tuning approach shows that GPT-4, the best performing LLM, surpasses the baseline model in automated evaluation metrics, with a 5.26% improvement in F1 score and a 6.95% improvement in exact match. Furthermore, compared to the previous baseline, GPT-4 demonstrates a decrease of 25.51% in wrong error type and a decrease of 26.27% in wrong error explanation. However, the results still lag behind the human baseline.
中文: 本研究提出了一种针对孟加拉语语法错误解释的三步提示调优方法,利用GPT-4等先进大语言模型进行错误识别、修正和解释,在自动评估指标上取得显著提升,但仍未达到人类基准水平。
English: This study introduces a three-step prompt-tuning method for Bengali Grammatical Error Explanation, utilizing advanced LLMs like GPT-4 to identify, correct, and explain errors, achieving notable improvements in automated metrics but still falling short of human performance.

Authors:Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See
Title: The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Abstract:
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
中文: 思维链提示在基于模式的上下文学习中表现不如直接回答,原因在于其显性推理难以推断模式,而隐性推理虽能部分弥补却受制于有缺陷的推理过程。
English: Chain-of-Thought prompting underperforms direct answering in pattern-based in-context learning due to a duality where explicit reasoning struggles with pattern inference while implicit reasoning partially compensates despite flawed rationales.

Authors:Yifan Xie, Julian Berberich, Robin Strässer, Frank Allgöwer
Title: Bilinear Data-Driven Min-Max MPC: Designing Rational Controllers via Sum-of-squares Optimization
Abstract:
We propose a data-driven min-max model predictive control (MPC) scheme to control unknown discrete-time bilinear systems. Based on a sequence of noisy input-state data, we state a set-membership representation for the unknown system dynamics. Then, we derive a sum-of-squares (SOS) program that minimizes an upper bound on the worst-case cost over all bilinear systems consistent with the data. As a crucial technical ingredient, the SOS program involves a rational controller parameterization to improve feasibility and tractability. We prove that the resulting data-driven MPC scheme ensures closed-loop stability and constraint satisfaction for the unknown bilinear system. We demonstrate the practicality of the proposed scheme in a numerical example.
中文: 本文提出了一种数据驱动的极小极大模型预测控制方案,通过平方和规划与有理控制器参数化,确保未知双线性系统的闭环稳定性和约束满足。
English: This paper introduces a data-driven min-max model predictive control scheme that ensures stability and constraint satisfaction for unknown bilinear systems through sum-of-squares programming and rational controller parameterization.

Authors:Eleonora Grassucci, Gualtiero Grassucci, Aurelio Uncini, Danilo Comminiello
Title: Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education
Abstract:
Artificial Intelligence (AI) holds transformative potential in education, enabling personalized learning, enhancing inclusivity, and encouraging creativity and curiosity. In this paper, we explore how Large Language Models (LLMs) can act as both patient tutors and collaborative partners to enhance education delivery. As tutors, LLMs personalize learning by offering step-by-step explanations and addressing individual needs, making education more inclusive for students with diverse backgrounds or abilities. As collaborators, they expand students' horizons, supporting them in tackling complex, real-world problems and co-creating innovative projects. However, to fully realize these benefits, LLMs must be leveraged not as tools for providing direct solutions but rather to guide students in developing resolving strategies and finding learning paths together. Therefore, a strong emphasis should be placed on educating students and teachers on the successful use of LLMs to ensure their effective integration into classrooms. Through practical examples and real-world case studies, this paper illustrates how LLMs can make education more inclusive and engaging while empowering students to reach their full potential.
中文: 大语言模型(LLMs)通过充当个性化导师和协作伙伴来提升教育质量,促进包容性与参与度,同时需要对学生和教师进行正确引导以充分发挥其优势。
English: Large Language Models (LLMs) can enhance education by serving as personalized tutors and collaborative partners, fostering inclusivity and engagement while requiring proper guidance for students and teachers to maximize their benefits.

Authors:Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang
Title: M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Abstract:
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.
中文: M²IV方法通过引入可学习的多模态上下文向量替代密集的标记演示,实现了细粒度语义提炼和跨模态学习,从而显著提升大型视觉语言模型的性能与效率。
English: The M²IV method introduces learnable multimodal in-context vectors to replace token-intensive demonstrations, enhancing performance and efficiency in large vision-language models by enabling fine-grained semantic distillation and cross-modal learning.

Authors:Rui Xu, Xing Fan, Shengcai Liu, Wenjie Chen, Ke Tang
Title: Memetic Search for Green Vehicle Routing Problem with Private Capacitated Refueling Stations
Abstract:
The green vehicle routing problem with private capacitated alternative fuel stations (GVRP-PCAFS) extends the traditional green vehicle routing problem by considering refueling stations limited capacity, where a limited number of vehicles can refuel simultaneously with additional vehicles must wait. This feature presents new challenges for route planning, as waiting times at stations must be managed while keeping route durations within limits and reducing total travel distance. This article presents METS, a novel memetic algorithm (MA) with separate constraint-based tour segmentation (SCTS) and efficient local search (ELS) for solving GVRP-PCAFS. METS combines global and local search effectively through three novelties. For global search, the SCTS strategy splits giant tours to generate diverse solutions, and the search process is guided by a comprehensive fitness evaluation function to dynamically control feasibility and diversity to produce solutions that are both diverse and near-feasible. For local search, ELS incorporates tailored move operators with constant-time move evaluation mechanisms, enabling efficient exploration of large solution neighborhoods. Experimental results demonstrate that METS discovers 31 new best-known solutions out of 40 instances in existing benchmark sets, achieving substantial improvements over current state-of-the-art methods. Additionally, a new large-scale benchmark set based on real-world logistics data is introduced to facilitate future research.
中文: 本文提出METS算法,采用创新策略解决考虑加油站容量限制的绿色车辆路径问题,实验证明其性能卓越,获得了31个最优解并引入了基于真实物流数据的新基准集。
English: This paper introduces METS, a memetic algorithm with novel strategies for solving the green vehicle routing problem that accounts for limited-capacity refueling stations, demonstrating superior performance by achieving 31 new best-known solutions and introducing a new real-world benchmark set.

Authors:Alejandro Fontan, Tobias Fischer, Javier Civera, Michael Milford
Title: VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets
Abstract:
Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation--all accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility and extendability while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions. We demonstrate the ease with which user-relevant benchmarks can be created: here, we introduce difficulty-level-based categories, but one could envision environment-specific or condition-specific categories.
Chinese: VSLAM-LAB是一个统一框架,通过将编译、数据集处理和标准化实验集成到单一命令行界面,简化了视觉SLAM系统的开发、评估与部署,提高了兼容性和可复现性。
English: VSLAM-LAB is a unified framework that simplifies the development, evaluation, and deployment of Visual SLAM systems by integrating compilation, dataset handling, and standardized experiments into a single command-line interface, enhancing compatibility and reproducibility.

Authors:Hang Zhao, Juzhan Xu, Kexiong Yu, Ruizhen Hu, Chenyang Zhu, Bo Du, Kai Xu
Title: Deliberate Planning of 3D Bin Packing on Packing Configuration Trees
Abstract:
Online 3D Bin Packing Problem (3D-BPP) has widespread applications in industrial automation. Existing methods usually solve the problem with limited resolution of spatial discretization, and/or cannot deal with complex practical constraints well. We propose to enhance the practical applicability of online 3D-BPP via learning on a novel hierarchical representation, packing configuration tree (PCT). PCT is a full-fledged description of the state and action space of bin packing which can support packing policy learning based on deep reinforcement learning (DRL). The size of the packing action space is proportional to the number of leaf nodes, making the DRL model easy to train and well-performing even with continuous solution space. We further discover the potential of PCT as tree-based planners in deliberately solving packing problems of industrial significance, including large-scale packing and different variations of BPP setting. A recursive packing method is proposed to decompose large-scale packing into smaller sub-trees while a spatial ensemble mechanism integrates local solutions into global. For different BPP variations with additional decision variables, such as lookahead, buffering, and offline packing, we propose a unified planning framework enabling out-of-the-box problem solving. Extensive evaluations demonstrate that our method outperforms existing online BPP baselines and is versatile in incorporating various practical constraints. The planning process excels across large-scale problems and diverse problem variations. We develop a real-world packing robot for industrial warehousing, with careful designs accounting for constrained placement and transportation stability. Our packing robot operates reliably and efficiently on unprotected pallets at 10 seconds per box. It achieves averagely 19 boxes per pallet with 57.4% space utilization for relatively large-size boxes.
中文摘要:本研究提出了一种称为包装配置树(PCT)的分层表示方法,通过深度强化学习提升了在线三维装箱问题的实际应用性,在现实场景中有效处理复杂约束并展现出卓越性能。
English Summary: The study introduces a hierarchical representation called the packing configuration tree (PCT) to enhance the practical applicability of online 3D bin packing, demonstrating superior performance through deep reinforcement learning and effective handling of complex constraints in real-world applications.

Authors:Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Yuwei Guo, Dahua Lin, Tianfan Xue, Bo Dai
Title: Multi-identity Human Image Animation with Structural Video Diffusion
Abstract:
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.
Chinese Summary: 结构视频扩散是一种创新框架,通过采用身份特定嵌入和结合深度线索的结构学习机制,能够生成具有动态交互的真实多人视频,有效解决了现有方法在复杂人机交互场景中的局限性。
English Summary: Structural Video Diffusion is a novel framework that generates realistic multi-human videos by using identity-specific embeddings and structural learning with depth cues, addressing the limitations of existing methods in handling complex human-object interactions.

Authors:Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Guan Huang, Lihong Liu, Xingang Wang
Title: HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration
Abstract:
Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.
中文: HumanDreamer-X提出了一种将多视角人体生成与三维重建相统一的框架,利用3D高斯溅射技术和注意力调制策略,显著提升了单图像人体重建的几何一致性与视觉保真度。
English: HumanDreamer-X introduces a unified framework that integrates multi-view human generation and 3D reconstruction, leveraging 3D Gaussian Splatting and an attention modulation strategy to significantly enhance geometric consistency and visual fidelity in single-image human reconstruction.

Authors:Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu
Title: Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation
Abstract:
With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs' ability to reason and explain with visual grounding in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.
中文: 本文针对大型语言模型在教育解释中缺乏视觉辅助的问题,提出了多模态解题说明任务并创建ME2基准,发现多数模型难以结合视觉要素进行推理,突显了开发具备多模态解释能力的AI导师的必要性。
English: This paper introduces a multimodal solution explanation task to address the lack of visual aids in LLM-generated educational explanations and presents the ME2 benchmark, revealing that most models struggle with visually grounded reasoning despite its importance for effective tutoring.

Authors:Xinghong Fu, Ziming Liu, Max Tegmark
Title: Do Two AI Scientists Agree?
Abstract:
When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems.
中文: 当训练数据增多时,在同一科学任务上训练的AI模型会趋于学习相似的理论,尽管可能形成不同理论群组;我们提出的MASS框架揭示了随着系统复杂度增加,AI科学家会从哈密顿理论转向拉格朗日表述,同时训练动态表现出强烈的随机种子依赖性。
English: AI models trained on the same scientific task converge on similar theories as more data is provided, though they may form distinct groups, and our proposed MASS framework reveals a shift from Hamiltonian to Lagrangian formulations with increasing complexity while highlighting seed-dependent training dynamics.

Authors:Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Weijie Wang, Haoyun Li, Guosheng Zhao, Jie Li, Wenkang Qin, Guan Huang, Wenjun Mei
Title: WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
Abstract:
Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.
中文摘要:WonderTurbo首次实现了实时交互式3D场景生成框架,通过StepSplat几何构建和QuickDepth深度补全加速几何建模,配合FastPaint即时绘制技术,能在0.72秒内生成新视角3D场景,速度提升15倍的同时保持优异的空间一致性和输出质量。
English Summary: WonderTurbo introduces the first real-time interactive 3D generation framework that creates novel 3D perspectives in 0.72 seconds through accelerated geometric modeling with StepSplat and QuickDepth, and instant appearance modeling with FastPaint, achieving a 15X speedup while maintaining high-quality output.

Authors:Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Zhaoxiang Zhang
Title: Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Abstract:
The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
中文: 本研究提出Ross3D方法,通过结合跨视角和全局视角重建的3D感知视觉监督训练,在多模态模型中增强了对三维场景的理解能力,不仅取得了最先进的性能,还展现出利用未标注三维视觉数据的巨大潜力。
English: This work introduces Ross3D, a method that enhances 3D scene understanding in multimodal models by integrating reconstructive visual instruction tuning with cross-view and global-view reconstruction, achieving state-of-the-art results and showing potential for leveraging unlabeled 3D data.

Authors:Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille
Title: ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction
Abstract:
In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.
Chinese: ReVision框架通过将参数化的3D物理知识融入预训练视频生成模型,显著提升了复杂运动和交互的生成质量,仅用15亿参数就超越了更大规模的先进模型。
English: The ReVision framework enhances video generation by integrating 3D physical knowledge into a pretrained model, significantly improving motion fidelity and coherence even with fewer parameters.

Authors:Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haoyue Bai, Sixun Dong, Haifeng Chen, Yanjie Fu
Title: Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming
Abstract:
Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.
中文摘要:本文提出了一种新颖的无监督特征转换框架,通过LLM智能体采用生成器-评判器二重奏机制来高效提升数据效用,在多个数据集上的表现优于有监督基线方法。
English Summary: This paper introduces a novel unsupervised feature transformation framework using LLM agents in a generator-critic duet to efficiently enhance data utility, which outperforms supervised baselines across multiple datasets.

Authors:Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang
Title: UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities
Abstract:
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.
中文摘要:UniversalRAG提出了一种新颖框架,通过动态路由查询至合适的模态特定语料库并整合多粒度知识,突破了单模态检索的局限,在多种基准测试中展现出卓越性能。
English Summary: UniversalRAG introduces a novel framework that overcomes the limitations of single-modality retrieval by dynamically routing queries to appropriate modality-specific corpora and incorporating multi-granular knowledge, demonstrating superior performance across diverse benchmarks.

Authors:Zihan Niu, Zheyong Xie, Shaosheng Cao, Chonggang Lu, Zheyu Ye, Tong Xu, Zuozhu Liu, Yan Gao, Jia Chen, Zhe Xu, Yi Wu, Yao Hu
Title: PaRT: Enhancing Proactive Social Chatbots with Personalized Real-Time Retrieval
Abstract:
Social chatbots have become essential intelligent companions in daily scenarios ranging from emotional support to personal interaction. However, conventional chatbots with passive response mechanisms usually rely on users to initiate or sustain dialogues by bringing up new topics, resulting in diminished engagement and shortened dialogue duration. In this paper, we present PaRT, a novel framework enabling context-aware proactive dialogues for social chatbots through personalized real-time retrieval and generation. Specifically, PaRT first integrates user profiles and dialogue context into a large language model (LLM), which is initially prompted to refine user queries and recognize their underlying intents for the upcoming conversation. Guided by refined intents, the LLM generates personalized dialogue topics, which then serve as targeted queries to retrieve relevant passages from RedNote. Finally, we prompt LLMs with summarized passages to generate knowledge-grounded and engagement-optimized responses. Our approach has been running stably in a real-world production environment for more than 30 days, achieving a 21.77\% improvement in the average duration of dialogues.
中文摘要:PaRT框架通过个性化实时检索与生成技术,使社交聊天机器人能够进行情境感知的主动对话,在实际应用中使对话平均时长提升21.77%。
English Summary: The PaRT framework enhances social chatbots by enabling context-aware proactive dialogues through personalized real-time retrieval and generation, significantly increasing dialogue duration by 21.77% in real-world deployment.

Authors:Jiang Wu, Rui Li, Yu Zhu, Rong Guo, Jinqiu Sun, Yanning Zhang
Title: Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views
Abstract:
We present a Gaussian Splatting method for surface reconstruction using sparse input views. Previous methods relying on dense views struggle with extremely sparse Structure-from-Motion points for initialization. While learning-based Multi-view Stereo (MVS) provides dense 3D points, directly combining it with Gaussian Splatting leads to suboptimal results due to the ill-posed nature of sparse-view geometric optimization. We propose Sparse2DGS, an MVS-initialized Gaussian Splatting pipeline for complete and accurate reconstruction. Our key insight is to incorporate the geometric-prioritized enhancement schemes, allowing for direct and robust geometric learning under ill-posed conditions. Sparse2DGS outperforms existing methods by notable margins while being ${2}\times$ faster than the NeRF-based fine-tuning approach.
Chinese: 我们提出Sparse2DGS方法,通过多视角立体初始化结合几何优先增强机制,在稀疏视角下实现完整精确的表面重建,在重建质量和速度上均显著优于现有方法。
English: We introduce Sparse2DGS, a Gaussian Splatting method that leverages Multi-view Stereo initialization and geometric-prioritized enhancement to achieve complete and accurate surface reconstruction from sparse views, outperforming existing approaches in both quality and speed.

Authors:Anush Lakshman Sivaraman, Kojo Adu-Gyamfi, Ibne Farabi Shihab, Anuj Sharma
Title: ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery
Abstract:
Adverse weather conditions challenge safe transportation, necessitating robust real-time weather detection from traffic camera imagery. We propose a novel framework combining CycleGAN-based domain adaptation with efficient contrastive learning to enhance weather classification, particularly in low-light nighttime conditions. Our approach leverages the lightweight SigLIP-2 model, which employs pairwise sigmoid loss to reduce computational demands, integrated with CycleGAN to transform nighttime images into day-like representations while preserving weather cues. Evaluated on an Iowa Department of Transportation dataset, the baseline EVA-02 model with CLIP achieves a per-class overall accuracy of 96.55\% across three weather conditions (No Precipitation, Rain, Snow) and a day/night overall accuracy of 96.55\%, but shows a significant day-night gap (97.21\% day vs.\ 63.40\% night). With CycleGAN, EVA-02 improves to 97.01\% per-class accuracy and 96.85\% day/night accuracy, boosting nighttime performance to 82.45\%. Our Vision-SigLIP-2 + Text-SigLIP-2 + CycleGAN + Contrastive configuration excels in nighttime scenarios, achieving the highest nighttime accuracy of 85.90\%, with 94.00\% per-class accuracy and 93.35\% day/night accuracy. This model reduces training time by 89\% (from 6 hours to 40 minutes) and inference time by 80\% (from 15 seconds to 3 seconds) compared to EVA-02. By narrowing the day-night performance gap from 33.81 to 8.90 percentage points, our framework provides a scalable, efficient solution for all-weather classification using existing camera infrastructure.
本研究提出了一种结合CycleGAN域适应与高效对比学习的新框架,显著提升了夜间天气分类的准确性,同时将训练时间减少89%,推理时间减少80%。
This study introduces a novel framework combining CycleGAN-based domain adaptation with efficient contrastive learning to significantly improve weather classification accuracy in nighttime conditions while reducing computational costs by 89% for training and 80% for inference.

Authors:Anush Lakshman Sivaraman, Kojo Adu-Gyamfi, Ibne Farabi Shihab, Anuj Sharma
Title: ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery
Abstract:
Adverse weather conditions challenge safe transportation, necessitating robust real-time weather detection from traffic camera imagery. We propose a novel framework combining CycleGAN-based domain adaptation with efficient contrastive learning to enhance weather classification, particularly in low-light nighttime conditions. Our approach leverages the lightweight SigLIP-2 model, which employs pairwise sigmoid loss to reduce computational demands, integrated with CycleGAN to transform nighttime images into day-like representations while preserving weather cues. Evaluated on an Iowa Department of Transportation dataset, the baseline EVA-02 model with CLIP achieves a per-class overall accuracy of 96.55\% across three weather conditions (No Precipitation, Rain, Snow) and a day/night overall accuracy of 96.55\%, but shows a significant day-night gap (97.21\% day vs.\ 63.40\% night). With CycleGAN, EVA-02 improves to 97.01\% per-class accuracy and 96.85\% day/night accuracy, boosting nighttime performance to 82.45\%. Our Vision-SigLIP-2 + Text-SigLIP-2 + CycleGAN + Contrastive configuration excels in nighttime scenarios, achieving the highest nighttime accuracy of 85.90\%, with 94.00\% per-class accuracy and 93.35\% day/night accuracy. This model reduces training time by 89\% (from 6 hours to 40 minutes) and inference time by 80\% (from 15 seconds to 3 seconds) compared to EVA-02. By narrowing the day-night performance gap from 33.81 to 8.90 percentage points, our framework provides a scalable, efficient solution for all-weather classification using existing camera infrastructure.
本研究提出了一种结合CycleGAN域适应与高效对比学习的新框架,显著提升了夜间天气分类的准确性,同时将训练时间减少89%,推理时间减少80%。
This study introduces a novel framework combining CycleGAN-based domain adaptation with efficient contrastive learning to significantly improve weather classification accuracy in nighttime conditions while reducing computational costs by 89% for training and 80% for inference.

Authors:Haozhen Yan, Yan Hong, Jiahui Zhan, Yikun Ji, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang
Title: COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization
Abstract:
Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.
This paper introduces COCOInpaint, a specialized benchmark addressing the gap in detecting inpainting-based image manipulations through 258,266 diverse samples generated by six advanced models, focusing on intrinsic inconsistencies rather than surface artifacts.
English Summary:

Authors:Zheng Qin, Le Wang, Yabing Wang, Sanping Zhou, Gang Hua, Wei Tang
Title: RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation
Abstract:
Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images, then passing them to a policy network. However, challenges remain: (1) Semantic features often fail to provide accurate directional information, leading to superfluous actions, and (2) performance drops significantly when viewpoint inconsistencies arise between training and application. To address these challenges, we propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance. Specifically, we model the spatial relationship by constructing correlations between the goal and current observations, which are then passed to the policy network for action prediction. These correlations are progressively refined using fine-grained cross-correlation and direction-aware correlation for more precise navigation. Extensive evaluation of RSRNav on three benchmark datasets demonstrates superior navigation performance, particularly in the "user-matched goal" setting, highlighting its potential for real-world applications.
中文:RSRNav通过建立目标与当前观测之间的空间关联作为导航指引,利用交叉相关和方向感知优化策略,显著提升了图像目标导航的精度和实际应用效果。
English: RSRNav enhances image-goal navigation by modeling spatial relationships between goal and current observations through cross-correlation mechanisms, achieving superior performance in real-world scenarios.

Authors:Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang
Title: From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.
中文: 本文提出一种两阶段零样本组合图像检索框架,先通过视觉语义注入增强图像到伪词映射,再利用少量合成数据优化文本组合能力,在三个数据集上实现优于现有方法的性能。
English: This paper introduces a two-stage framework for zero-shot composed image retrieval that first enhances visual-to-pseudo-word mapping and then optimizes text composition, achieving superior performance with minimal synthetic data across three datasets.

Authors:Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu
Title: Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
Abstract:
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.
中文:Turbo2K通过知识蒸馏和分层合成框架,在保持丰富细节的同时实现高效2K视频生成,推理速度提升高达20倍,大幅降低了计算成本。
English: Turbo2K is an efficient framework that enables high-quality 2K video generation through knowledge distillation and a hierarchical synthesis approach, achieving up to 20x faster inference while maintaining visual detail.

Authors:Philip Wiese, Maurus Item, Luca Bertaccini, Yvan Tortorella, Angelo Garofalo, Luca Benini
Title: RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine
Abstract:
As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.
中文: RedMulE-FT是一种运行时可配置的容错扩展方案,通过结合复制和检错码保护数据路径,在保持性能的同时以微小面积开销实现显著的故障减少。
English: RedMulE-FT is a runtime-configurable fault-tolerant extension for matrix multiplication accelerators that combines replication with error-detecting codes, achieving significant fault reduction with minimal area overhead while maintaining performance.

Authors:Dong Won Lee, Yubin Kim, Denison Guvenoz, Sooyeon Jeong, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
Title: The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning
Abstract:
Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.
中文摘要:本研究通过引入大规模人机社交互动数据集和八项基准任务,旨在提升人工智能在真实社交场景中的推理能力,实验表明当前语言模型虽具潜力,但在社交互动评估方面仍存在明显不足。
English Summary: This research introduces a large-scale Human-Robot Social Interaction dataset and eight benchmark tasks to advance AI's social reasoning capabilities, revealing that current language models still struggle with real-world social interaction evaluation despite their potential.

Authors:Numan Saeed, Shahad Hardan, Muhammad Ridzuan, Nada Saadi, Karthik Nandakumar, Mohammad Yaqub
Title: Efficient Parameter Adaptation for Multi-Modal Medical Image Segmentation and Prognosis
Abstract:
Cancer detection and prognosis relies heavily on medical imaging, particularly CT and PET scans. Deep Neural Networks (DNNs) have shown promise in tumor segmentation by fusing information from these modalities. However, a critical bottleneck exists: the dependency on CT-PET data concurrently for training and inference, posing a challenge due to the limited availability of PET scans. Hence, there is a clear need for a flexible and efficient framework that can be trained with the widely available CT scans and can be still adapted for PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans such that it can be efficiently adapted for use with PET scans when they become available. This framework is further extended to perform prognosis task maintaining the same efficient cross-modal fine-tuning approach. The proposed approach is tested with two well-known segementation backbones, namely UNETR and Swin UNETR. Our approach offers two main advantages. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) as well as decomposed low-rank adaptation (DoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, by minimizing cross-modal entanglement, PEMMA allows updates using only one modality without causing catastrophic forgetting in the other. Our method achieves comparable performance to early fusion, but with only 8% of the trainable parameters, and demonstrates a significant +28% Dice score improvement on PET scans when trained with a single modality. Furthermore, in prognosis, our method improves the concordance index by +10% when adapting a CT-pretrained model to include PET scans, and by +23% when adapting for both PET and EHR data.
中文摘要:PEMMA框架通过低秩适配技术,实现了仅用CT扫描训练的模型向PET扫描的高效迁移,在保持少量可训练参数的同时,显著提升了肿瘤分割准确率和预后预测性能。
English Summary: The PEMMA framework enables parameter-efficient adaptation of CT-trained segmentation models to incorporate PET scans when available, achieving comparable performance with minimal parameters and significant improvements in both segmentation and prognosis tasks.

Authors:Niklas Funk, Changqi Chen, Tim Schneider, Georgia Chalvatzaki, Roberto Calandra, Jan Peters
Title: On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting
Abstract:
The field of robotic manipulation has advanced significantly in the last years. At the sensing level, several novel tactile sensors have been developed, capable of providing accurate contact information. On a methodological level, learning from demonstrations has proven an efficient paradigm to obtain performant robotic manipulation policies. The combination of both holds the promise to extract crucial contact-related information from the demonstration data and actively exploit it during policy rollouts. However, despite its potential, it remains an underexplored direction. This work therefore proposes a multimodal, visuotactile imitation learning framework capable of efficiently learning fast and dexterous manipulation policies. We evaluate our framework on the dynamic, contact-rich task of robotic match lighting - a task in which tactile feedback influences human manipulation performance. The experimental results show that adding tactile information into the policies significantly improves performance by over 40%, thereby underlining the importance of tactile sensing for contact-rich manipulation tasks. Project website: https://sites.google.com/view/tactile-il .
中文: 本研究提出了一种多模态视觉触觉模仿学习框架,在接触密集的机器人任务中通过整合触觉信息使性能提升超过40%,证实了触觉感知对灵巧操作的关键作用。
English: This work introduces a multimodal visuotactile imitation learning framework that significantly enhances robotic manipulation performance by over 40% in contact-rich tasks, demonstrating the critical role of tactile sensing.

Authors:Jiasheng Wu, Jingjing Zhang, Zheng Lin, Zhe Chen, Xiong Wang, Wenjun Zhu, Yue Gao
Title: SFL-LEO: Asynchronous Split-Federated Learning Design for LEO Satellite-Ground Network Framework
Abstract:
Recently, the rapid development of LEO satellite networks spurs another widespread concern-data processing at satellites. However, achieving efficient computation at LEO satellites in highly dynamic satellite networks is challenging and remains an open problem when considering the constrained computation capability of LEO satellites. For the first time, we propose a novel distributed learning framework named SFL-LEO by combining Federated Learning (FL) with Split Learning (SL) to accommodate the high dynamics of LEO satellite networks and the constrained computation capability of LEO satellites by leveraging the periodical orbit traveling feature. The proposed scheme allows training locally by introducing an asynchronous training strategy, i.e., achieving local update when LEO satellites disconnect with the ground station, to provide much more training space and thus increase the training performance. Meanwhile, it aggregates client-side sub-models at the ground station and then distributes them to LEO satellites by borrowing the idea from the federated learning scheme. Experiment results driven by satellite-ground bandwidth measured in Starlink demonstrate that SFL-LEO provides a similar accuracy performance with the conventional SL scheme because it can perform local training even within the disconnection duration.
中文: 我们首次提出名为SFL-LEO的新型分布式学习框架,结合联邦学习与分割学习,利用卫星周期性轨道特征和异步训练策略,有效应对低轨卫星网络的高动态性和计算能力限制。
English: A novel distributed learning framework called SFL-LEO integrates Federated Learning and Split Learning to address the challenges of high dynamics and limited computational capacity in LEO satellite networks by utilizing their periodic orbit patterns and asynchronous training strategies.

Authors:Sudesh Ramesh Bhagat, Ibne Farabi Shihab, Anuj Sharma
Title: Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models
Abstract:
This study investigates the relationship between deep learning (DL) model accuracy and expert agreement in classifying crash narratives. We evaluate five DL models -- including BERT variants, USE, and a zero-shot classifier -- against expert labels and narratives, and extend the analysis to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our findings reveal an inverse relationship: models with higher technical accuracy often show lower agreement with human experts, while LLMs demonstrate stronger expert alignment despite lower accuracy. We use Cohen's Kappa and Principal Component Analysis (PCA) to quantify and visualize model-expert agreement, and employ SHAP analysis to explain misclassifications. Results show that expert-aligned models rely more on contextual and temporal cues than location-specific keywords. These findings suggest that accuracy alone is insufficient for safety-critical NLP tasks. We argue for incorporating expert agreement into model evaluation frameworks and highlight the potential of LLMs as interpretable tools in crash analysis pipelines.
中文摘要:本研究发现深度学习模型在事故叙述分类中的准确性与专家共识呈反向关系,技术精度更高的模型往往与专家判断一致性更低,而大语言模型虽精度较低却展现出更强的专家对齐能力。
English Summary: This study finds an inverse relationship between deep learning model accuracy and expert agreement in crash narrative classification, revealing that higher technical accuracy often corresponds to lower human alignment while large language models demonstrate stronger expert consensus despite reduced accuracy.

Authors:Sudesh Ramesh Bhagat, Ibne Farabi Shihab, Anuj Sharma
Title: Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models
Abstract:
This study investigates the relationship between deep learning (DL) model accuracy and expert agreement in classifying crash narratives. We evaluate five DL models -- including BERT variants, USE, and a zero-shot classifier -- against expert labels and narratives, and extend the analysis to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our findings reveal an inverse relationship: models with higher technical accuracy often show lower agreement with human experts, while LLMs demonstrate stronger expert alignment despite lower accuracy. We use Cohen's Kappa and Principal Component Analysis (PCA) to quantify and visualize model-expert agreement, and employ SHAP analysis to explain misclassifications. Results show that expert-aligned models rely more on contextual and temporal cues than location-specific keywords. These findings suggest that accuracy alone is insufficient for safety-critical NLP tasks. We argue for incorporating expert agreement into model evaluation frameworks and highlight the potential of LLMs as interpretable tools in crash analysis pipelines.
中文摘要:本研究发现深度学习模型在事故叙述分类中的准确性与专家共识呈反向关系,技术精度更高的模型往往与专家判断一致性更低,而大语言模型虽精度较低却展现出更强的专家对齐能力。
English Summary: This study finds an inverse relationship between deep learning model accuracy and expert agreement in crash narrative classification, revealing that higher technical accuracy often corresponds to lower human alignment while large language models demonstrate stronger expert consensus despite reduced accuracy.

Authors:Libo Zhang, Yongsheng Yu, Jiali Yao, Heng Fan
Title: High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion
Abstract:
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
中文: 提出的MMInvertFill方法通过多模态引导编码器和F&W+潜在空间改进图像修复,有效解决色彩差异和语义不一致问题,在多个数据集上超越现有先进技术。
English: The proposed MMInvertFill method enhances image inpainting through a multimodal guided encoder and F&W+ latent space, effectively addressing color and semantic inconsistencies to outperform existing techniques across diverse datasets.

Authors:Fardad Vakilipoor, Andreas Ettner-Sitter, Lucas Brand, Sebastian Lotter, Thiha Aung, Silke Harteis, Robert Schober, Maximilian Schäfer
Title: The CAM Model: An in vivo Testbed for Molecular Communication Systems
Abstract:
Molecular communication (MC) research increasingly focuses on biomedical applications like health monitoring and drug delivery, demanding testing in realistic living environments. Elevating MC research requires developing advanced in vivo testbeds. We introduce the chorioallantoic membrane (CAM) model as the first versatile 3D in vivo MC platform. The CAM, a highly vascularized membrane in fertilized chicken eggs, is established in bioengineering, cancer research, and drug development. Its biological realism, reproducibility, and versatility make it ideal for next-generation MC testbeds, bridging proof-of-concept systems and practical applications. We comprehensively characterize the CAM model's properties and MC system relevance. Through experimental studies, we investigate fluorescent molecule distribution in the CAM's closed-loop vascular system. We derive an analytical model using the wrapped normal distribution to describe particle propagation in dispersive closed-loop systems dominated by diffusion and flow. Parametric models are developed to approximate particle dynamics in the CAM, with parameters estimated via nonlinear least squares curve fitting. A dataset of 69 regions from 25 eggs validates our models. We analyze parameter relationships and biological plausibility. Finally, we develop a parametric model for long-term particle behavior and liver accumulation in chick embryos.
中文摘要:本研究首次提出绒毛尿囊膜(CAM)模型作为分子通信研究的通用三维体内实验平台,通过建立参数化模型分析粒子在闭环血管系统中的传播特性,并经过大量实验数据验证了其生物可行性。
English Summary: The chorioallantoic membrane (CAM) model is introduced as a pioneering 3D in vivo platform for molecular communication research, enabling experimental studies of particle propagation in biological systems through analytical modeling and empirical validation.

Authors:Tao Wen, Jiepeng Wang, Yabo Chen, Shugong Xu, Chi Zhang, Xuelong Li
Title: Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image
Abstract:
Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.
中文: Metric-Solver提出了一种基于滑动锚点的方法,通过将深度归一化为近场和远场分量来动态适应不同场景尺度,实现了统一的深度表示,并在多样环境中展现出卓越的泛化能力。
English: Metric-Solver introduces a sliding anchor-based method that dynamically adapts to varying scene scales by normalizing depth into near-field and far-field components, enabling unified depth representation and superior generalization across diverse environments.

Authors:Jiaqi Xue, Xin Xin, Wei Zhang, Mengxin Zheng, Qianqian Song, Minxuan Zhou, Yushun Dong, Dongjie Wang, Xun Chen, Jiafeng Xie, Liqiang Wang, David Mohaisen, Hongyi Wu, Qian Lou
Title: Measuring Computational Universality of Fully Homomorphic Encryption
Abstract:
Many real-world applications, such as machine learning and graph analytics, involve combinations of linear and non-linear operations. As these applications increasingly handle sensitive data, there is a significant demand for privacy-preserving computation techniques capable of efficiently supporting both types of operations-a property we define as "computational universality." Fully Homomorphic Encryption (FHE) has emerged as a powerful approach to perform computations directly on encrypted data. In this paper, we systematically evaluate and measure whether existing FHE methods achieve computational universality or primarily favor either linear or non-linear operations, especially in non-interactive settings. We evaluate FHE universality in three stages. First, we categorize existing FHE methods into ten distinct approaches and analyze their theoretical complexities, selecting the three most promising universal candidates. Next, we perform measurements on representative workloads that combine linear and non-linear operations in various sequences, assessing performance across different bit lengths and with SIMD parallelization enabled or disabled. Finally, we empirically evaluate these candidates on five real-world, privacy-sensitive applications, where each involving arithmetic (linear) and comparison-like (non-linear) operations. Our findings indicate significant overheads in current universal FHE solutions, with efficiency strongly influenced by SIMD parallelism, word-wise versus bit-wise operations, and the trade-off between approximate and exact computations. Additionally, our analysis provides practical guidance to help practitioners select the most suitable universal FHE schemes and algorithms based on specific application requirements.
This paper systematically evaluates whether existing Fully Homomorphic Encryption (FHE) methods achieve computational universality by efficiently supporting both linear and non-linear operations, finding significant performance overheads while providing practical guidance for scheme selection.
English Summary:

Authors:Elizabeth Fons, Rachneet Kaur, Zhen Zeng, Soham Palande, Tucker Balch, Svitlana Vyetrenko, Manuela Veloso
Title: TADACap: Time-series Adaptive Domain-Aware Captioning
Abstract:
While image captioning has gained significant attention, the potential of captioning time-series images, prevalent in areas like finance and healthcare, remains largely untapped. Existing time-series captioning methods typically offer generic, domain-agnostic descriptions of time-series shapes and struggle to adapt to new domains without substantial retraining. To address these limitations, we introduce TADACap, a retrieval-based framework to generate domain-aware captions for time-series images, capable of adapting to new domains without retraining. Building on TADACap, we propose a novel retrieval strategy that retrieves diverse image-caption pairs from a target domain database, namely TADACap-diverse. We benchmarked TADACap-diverse against state-of-the-art methods and ablation variants. TADACap-diverse demonstrates comparable semantic accuracy while requiring significantly less annotation effort.
中文: TADACap是一种创新的基于检索的框架,可为时间序列图像生成领域感知的描述并无需重新训练即可适应新领域,其增强版TADACap-diverse在保持语义准确性的同时显著降低了标注需求。
English: TADACap is a novel retrieval-based framework that generates domain-aware captions for time-series images and adapts to new domains without retraining, with its enhanced version TADACap-diverse showing competitive semantic accuracy while reducing annotation needs.

Authors:Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Title: Seedream 3.0 Technical Report
Abstract:
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
中文:Seedream 3.0 是一款高性能中英双语图像生成基础模型,通过全流程技术升级解决了复杂提示对齐、精细排版生成等挑战,在保持图像质量的同时实现4-8倍加速,并能原生输出高达2K的高清图像。
English: Seedream 3.0 is an enhanced Chinese-English bilingual image generation model that introduces technical improvements across data construction, training, and deployment, achieving superior alignment with complex prompts, fine-grained typography, and high-resolution outputs up to 2K while accelerating generation speed by 4-8 times.

Authors:Shuheng Hua, Yao Sun, Kairong Ma, Dusit Niyato, Muhammad Ali Imran
Title: A Mathematical Framework of Semantic Communication based on Category Theory
Abstract:
While semantic communication (SemCom) has recently demonstrated great potential to enhance transmission efficiency and reliability by leveraging machine learning (ML) and knowledge base (KB), there is a lack of mathematical modeling to rigorously characterize SemCom system and quantify the performance gain obtained from ML and KB. In this paper, we develop a mathematical framework for SemCom based on category theory, rigorously modeling the concepts of semantic entities and semantic probability space. Within this framework, we introduce the semantic entropy to quantify the uncertainty of semantic entities. We theoretically prove that semantic entropy can be effectively reduced by exploiting KBs, which capture semantic dependencies. Within the formulated semantic space, semantic entities can be combined according to the required semantic ambiguity, and the combined entities can be encoded based on semantic dependencies obtained from KB. Then, we derive semantic channel capacity modeling, which incorporates the mutual information obtained in KB to accurately measure the transmission efficiency of SemCom. Numerical simulations validate the effectiveness of the proposed framework, showing that SemCom with KB integration outperforms traditional communication in both entropy reduction and coding efficiency.
中文: 本文基于范畴论构建了语义通信的数学框架,提出语义熵和语义信道容量的概念,证明通过知识库整合能有效降低语义不确定性,在传输效率上优于传统通信方式。
English: This paper introduces a mathematical framework for semantic communication using category theory, defining semantic entropy and channel capacity to demonstrate that integrating knowledge bases reduces uncertainty and enhances transmission efficiency over traditional methods.

Authors:Max Falkenberg, Matteo Cinelli, Alessandro Galeazzi, Christopher A. Bail, Rosa M Benito, Axel Bruns, Anatoliy Gruzd, David Lazer, Jae K Lee, Jennifer McCoy, Kikuko Nagayoshi, David G Rand, Antonio Scala, Alexandra Siegel, Sander van der Linden, Onur Varol, Ingmar Weber, Magdalena Wojcieszak, Fabiana Zollo, Andrea Baronchelli, Walter Quattrociocchi
Title: Towards global equity in political polarization research
Abstract:
With a folk understanding that political polarization refers to socio-political divisions within a society, many have proclaimed that we are more divided than ever. In this account, polarization has been blamed for populism, the erosion of social cohesion, the loss of trust in the institutions of democracy, legislative dysfunction, and the collective failure to address existential risks such as Covid-19 or climate change. However, at a global scale there is surprisingly little academic literature which conclusively supports these claims, with half of all studies being U.S.-focused. Here, we provide an overview of the global state of research on polarization, highlighting insights that are robust across countries, those unique to specific contexts, and key gaps in the literature. We argue that addressing these gaps is urgent, but has been hindered thus far by systemic and cultural barriers, such as regionally stratified restrictions on data access and misaligned research incentives. If continued cross-disciplinary inertia means that these disparities are left unaddressed, we see a substantial risk that countries will adopt policies to tackle polarization based on inappropriate evidence, risking flawed decision-making and the weakening of democratic institutions.
中文摘要:政治极化虽被普遍归咎于诸多社会问题,但全球学术研究仍显不足且地域分布不均,亟需填补认知空白以避免基于不当证据的政策决策。
English Summary: Political polarization is widely blamed for various societal ills, yet global academic research remains limited and regionally skewed, with urgent need to address knowledge gaps to prevent flawed policy decisions.

Authors:Yibiao Wei, Jie Zou, Weikang Guo, Guoqing Wang, Xing Xu, Yang Yang
Title: MSCRS: Multi-modal Semantic Graph Prompt Learning Framework for Conversational Recommender Systems
Abstract:
Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by interacting with users through conversations. Most existing studies of CRS focus on extracting user preferences from conversational contexts. However, due to the short and sparse nature of conversational contexts, it is difficult to fully capture user preferences by conversational contexts only. We argue that multi-modal semantic information can enrich user preference expressions from diverse dimensions (e.g., a user preference for a certain movie may stem from its magnificent visual effects and compelling storyline). In this paper, we propose a multi-modal semantic graph prompt learning framework for CRS, named MSCRS. First, we extract textual and image features of items mentioned in the conversational contexts. Second, we capture higher-order semantic associations within different semantic modalities (collaborative, textual, and image) by constructing modality-specific graph structures. Finally, we propose an innovative integration of multi-modal semantic graphs with prompt learning, harnessing the power of large language models to comprehensively explore high-dimensional semantic relationships. Experimental results demonstrate that our proposed method significantly improves accuracy in item recommendation, as well as generates more natural and contextually relevant content in response generation.
中文: 本文提出了一种用于对话推荐系统的多模态语义图提示学习框架,通过整合协同、文本和视觉数据来增强用户偏好建模,从而显著提高了推荐准确性并生成更自然的响应内容。
English: This paper introduces a multi-modal semantic graph prompt learning framework for conversational recommender systems, which enhances user preference modeling by integrating collaborative, textual, and visual data, leading to improved recommendation accuracy and more natural response generation.

Authors:Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Title: InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation
Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
中文: 本研究提出了一种新颖的手脸交互动画范式,通过开发区域感知扩散模型和大规模数据集来弥补交互动作研究的空白,以增强生物特征认证系统的安全性和准确性。
English: This research introduces a novel paradigm for animating realistic hand-face interactions, addressing the gap in interactive motion studies by developing a region-aware diffusion model and a large-scale dataset to enhance biometric authentication systems.

Authors:Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li
Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
Abstract:
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
中文: OmniVDiff提出了一种统一视频扩散框架,通过动态调整多种视觉模态的角色,实现了文本生成视频、视频理解及属性条件生成等多样化任务。
English: OmniVDiff introduces a unified video diffusion framework that dynamically adapts multiple visual modalities for versatile tasks including text-to-video generation, video understanding, and attribute-conditioned synthesis.

Authors:Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou
Title: How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
Abstract:
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
中文摘要:本研究通过梯度谱分析表明,在大型语言模型后训练中,更高质量的指令与推理数据通常对应较低的核范数和较高的有效秩,揭示了数据评估的统一指标,并发现不同模型系列间存在显著梯度模式差异。
English Summary: This study uses spectral analysis of gradients to show that higher-quality instruction and reasoning data in LLM post-training correlate with lower nuclear norms and higher effective ranks, revealing unified metrics for data evaluation and highlighting distinct gradient patterns across model families.

Authors:Yiwen Wang, Ying Liang, Yuxuan Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Title: Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution
Abstract:
Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.
Chinese: 针对真实图像退化与合成数据间的差异,我们提出一种融合语义引导的扩散模型超分辨率方法,通过优化退化模拟和细节生成,在CVPR NTIRE 2025竞赛中获得亚军,显著提升了复原效果。
English: Traditional super-resolution methods struggle with real-world image degradations, so we propose a diffusion-based approach enhanced by semantic guidance and fine-tuned parameters, which outperforms state-of-the-art techniques and secured second place in the CVPR NTIRE 2025 Challenge.

Authors:Seokweon Jung, Hyeon Jeon, Jeongmin Rhee, Jinwook Seo
Title: Can VLMs Assess Similarity Between Graph Visualizations?
Abstract:
Graph visualizations have been studied for tasks such as clustering and temporal analysis, but how these visual similarities relate to established graph similarity measures remains unclear. In this paper, we explore the potential of Vision Language Models (VLMs) to approximate human-like perception of graph similarity. We generate graph datasets of various sizes and densities and compare VLM-derived visual similarity scores with feature-based measures. Our findings indicate VLMs can assess graph similarity in a manner similar to feature-based measures, even though differences among the measures exist. In future work, we plan to extend our research by conducting experiments on human visual graph perception.
中文摘要:本研究探索了视觉语言模型在图形相似性评估中模拟人类视觉感知的能力,发现尽管存在差异,但模型得出的视觉相似度与基于特征的度量方法具有一致性。
English Summary: This study investigates how Vision Language Models (VLMs) can mimic human-like visual perception of graph similarity, finding that VLM-derived scores align with feature-based measures despite some variations.

Authors:Joshua Li, Fernando Jose Pena Cantu, Emily Yu, Alexander Wong, Yuchen Cui, Yuhao Chen
Title: SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos
Abstract:
Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.
Chinese: SAMJAM是一种零样本方法,结合SAM2的时间跟踪与Gemini的语义理解,在动态厨房环境中生成时间一致的场景图,在EPIC-KITCHENS数据集上比Gemini的平均召回率提高了8.33%。
English: SAMJAM is a zero-shot pipeline that integrates SAM2's temporal tracking with Gemini's semantic understanding to generate temporally consistent scene graphs in dynamic kitchen environments, achieving an 8.33% improvement in mean recall over Gemini on EPIC-KITCHENS datasets.

Authors:Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang
Title: SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Abstract:
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
中文摘要:本研究提出的自监督片段微调方法(SF²T)利用视频固有特征增强视频大语言模型的细粒度理解能力,同时通过FineVidBench基准在场景和片段层面提供全面评估。
English Summary: The proposed Self-Supervised Fragment Fine-Tuning (SF²T) method enhances Video-LLMs' fine-grained video understanding by leveraging inherent video characteristics, while the FineVidBench benchmark provides comprehensive evaluation across scene and fragment levels.

Authors:Nicole Tran, Anisa Prasad, Yan Zhuang, Tejas Sudharshan Mathai, Boah Kim, Sydney Lewis, Pritam Mukherjee, Jianfei Liu, Ronald M. Summers
Title: Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI
Abstract:
The segmentation of multiple organs in multi-parametric MRI studies is critical for many applications in radiology, such as correlating imaging biomarkers with disease status (e.g., cirrhosis, diabetes). Recently, three publicly available tools, such as MRSegmentator (MRSeg), TotalSegmentator MRI (TS), and TotalVibeSegmentator (VIBE), have been proposed for multi-organ segmentation in MRI. However, the performance of these tools on specific MRI sequence types has not yet been quantified. In this work, a subset of 40 volumes from the public Duke Liver Dataset was curated. The curated dataset contained 10 volumes each from the pre-contrast fat saturated T1, arterial T1w, venous T1w, and delayed T1w phases, respectively. Ten abdominal structures were manually annotated in these volumes. Next, the performance of the three public tools was benchmarked on this curated dataset. The results indicated that MRSeg obtained a Dice score of 80.7 $\pm$ 18.6 and Hausdorff Distance (HD) error of 8.9 $\pm$ 10.4 mm. It fared the best ($p < .05$) across the different sequence types in contrast to TS and VIBE.
中文: 本研究对三种公开的MRI多器官分割工具进行性能评估,结果表明MRSeg在不同序列类型中均优于TS和VIBE,获得了最高的Dice分数和最低的豪斯多夫距离误差。
English: This study benchmarks three public tools for multi-organ segmentation in MRI, finding that MRSeg outperforms TS and VIBE with the highest Dice score and lowest Hausdorff Distance across various sequence types.

Authors:Danilo Cammarata, Matteo Perotti, Marco Bertuletti, Angelo Garofalo, Pasquale Davide Schiavone, David Atienza, Luca Benini
Title: Quadrilatero: A RISC-V programmable matrix coprocessor for low-power edge applications
Abstract:
The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive accesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many researchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V programmable systolic array coprocessor for low-power edge applications that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, showing that it requires only 0.65 mm^2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC-V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively.
中文: 随着AI驱动的物联网应用快速增长,对高性能、低功耗的边缘处理器需求日益增加,为此我们开发了Quadrilatero,一种开源RISC-V脉动阵列协处理器,它通过优化矩阵计算,在面积效率和能效上分别比现有方案提升高达77%和15%。
English: The rapid growth of AI-based IoT applications has driven the need for high-performance edge processors with low power and area constraints, leading to the development of Quadrilatero, an open-source RISC-V systolic array coprocessor that enhances matrix computation efficiency and achieves up to 77% area and 15% energy improvements over existing solutions.

Authors:Tejas Sudharshan Mathai, Benjamin Hou, Ronald M. Summers
Title: Longitudinal Assessment of Lung Lesion Burden in CT
Abstract:
In the U.S., lung cancer is the second major cause of death. Early detection of suspicious lung nodules is crucial for patient treatment planning, management, and improving outcomes. Many approaches for lung nodule segmentation and volumetric analysis have been proposed, but few have looked at longitudinal changes in total lung tumor burden. In this work, we trained two 3D models (nnUNet) with and without anatomical priors to automatically segment lung lesions and quantified total lesion burden for each patient. The 3D model without priors significantly outperformed ($p < .001$) the model trained with anatomy priors. For detecting clinically significant lesions $>$ 1cm, a precision of 71.3\%, sensitivity of 68.4\%, and F1-score of 69.8\% was achieved. For segmentation, a Dice score of 77.1 $\pm$ 20.3 and Hausdorff distance error of 11.7 $\pm$ 24.1 mm was obtained. The median lesion burden was 6.4 cc (IQR: 2.1, 18.1) and the median volume difference between manual and automated measurements was 0.02 cc (IQR: -2.8, 1.2). Agreements were also evaluated with linear regression and Bland-Altman plots. The proposed approach can produce a personalized evaluation of the total tumor burden for a patient and facilitate interval change tracking over time.
中文: 本研究开发的无解剖先验3D模型在肺部病灶分割中显著优于带先验模型,实现了临床意义的检测指标,可为患者提供个性化肿瘤负荷评估以追踪随时间变化。
English: This study developed a 3D model without anatomical priors that outperformed one with priors in segmenting lung lesions, achieving clinically significant detection metrics and providing personalized tumor burden evaluation for tracking changes over time.

Authors:Anisa V. Prasad, Tejas Sudharshan Mathai, Pritam Mukherjee, Jianfei Liu, Ronald M. Summers
Title: Leveraging Anatomical Priors for Automated Pancreas Segmentation on Abdominal CT
Abstract:
An accurate segmentation of the pancreas on CT is crucial to identify pancreatic pathologies and extract imaging-based biomarkers. However, prior research on pancreas segmentation has primarily focused on modifying the segmentation model architecture or utilizing pre- and post-processing techniques. In this article, we investigate the utility of anatomical priors to enhance the segmentation performance of the pancreas. Two 3D full-resolution nnU-Net models were trained, one with 8 refined labels from the public PANORAMA dataset, and another that combined them with labels derived from the public TotalSegmentator (TS) tool. The addition of anatomical priors resulted in a 6\% increase in Dice score ($p < .001$) and a 36.5 mm decrease in Hausdorff distance for pancreas segmentation ($p < .001$). Moreover, the pancreas was always detected when anatomy priors were used, whereas there were 8 instances of failed detections without their use. The use of anatomy priors shows promise for pancreas segmentation and subsequent derivation of imaging biomarkers.
Chinese: 本研究证明,在CT扫描中引入解剖先验知识可显著提升胰腺分割效果,Dice系数提高6%,豪斯多夫距离缩短36.5毫米,并完全避免了检测失败的情况。
English: This study demonstrates that incorporating anatomical priors significantly improves pancreas segmentation on CT scans, achieving a 6% increase in Dice score and a 36.5 mm reduction in Hausdorff distance while eliminating detection failures.

Authors:Jinbo Peng, Junwen Duan, Zheng Lin, Haoxuan Yuan, Yue Gao, Zhe Chen
Title: SigChord: Sniffing Wide Non-sparse Multiband Signals for Terrestrial and Non-terrestrial Wireless Networks
Abstract:
While unencrypted information inspection in physical layer (e.g., open headers) can provide deep insights for optimizing wireless networks, the state-of-the-art (SOTA) methods heavily depend on full sampling rate (a.k.a Nyquist rate), and high-cost radios, due to terrestrial and non-terrestrial networks densely occupying multiple bands across large bandwidth (e.g., from 4G/5G at 0.4-7 GHz to LEO satellite at 4-40 GHz). To this end, we present SigChord, an efficient physical layer inspection system built on low-cost and sub-Nyquist sampling radios. We first design a deep and rule-based interleaving algorithm based on Transformer network to perform spectrum sensing and signal recovery under sub-Nyquist sampling rate, and second, cascade protocol identifier and decoder based on Transformer neural networks to help physical layer packets analysis. We implement SigChord using software-defined radio platforms, and extensively evaluate it on over-the-air terrestrial and non-terrestrial wireless signals. The experiments demonstrate that SigChord delivers over 99% accuracy in detecting and decoding, while still decreasing 34% sampling rate, compared with the SOTA approaches.
中文: SigChord是一种高效的物理层检测系统,它采用低成本、亚奈奎斯特采样率的无线电设备和基于Transformer的算法,在信号检测和解码方面实现了超过99%的准确率,同时相比现有技术将采样率降低了34%。
English: SigChord is an efficient physical layer inspection system that uses low-cost, sub-Nyquist sampling radios and Transformer-based algorithms to achieve over 99% accuracy in signal detection and decoding while reducing the sampling rate by 34% compared to current methods.

Authors:Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou
Title: Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Abstract:
We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.
Chinese: 推理大语言模型在面对缺失前提的不当问题时会产生过长且低效的响应,这暴露了其训练方法的根本缺陷——未能有效培养高效思维能力,导致普遍存在的过度思考现象。
English: Reasoning LLMs produce excessively long and inefficient responses to ill-posed questions with missing premises, revealing a critical flaw in their training that fails to encourage efficient thinking and leads to widespread overthinking.

Authors:Christopher Reinwardt, Robert Balas, Alessandro Ottaviano, Angelo Garofalo, Luca Benini
Title: CVA6-VMRT: A Modular Approach Towards Time-Predictable Virtual Memory in a 64-bit Application Class RISC-V Processor
Abstract:
The increasing complexity of autonomous systems has driven a shift to integrated heterogeneous SoCs with real-time and safety demands. Ensuring deterministic WCETs and low-latency for critical tasks requires minimizing interference on shared resources like virtual memory. Existing techniques, such as software coloring and memory replication, introduce significant area and performance overhead, especially with virtualized memory where address translation adds latency uncertainty. To address these limitations, we propose CVA6-VMRT, an extension of the open-source RISC-V CVA6 core, adding hardware support for predictability in virtual memory access with minimal area overhead. CVA6-VMRT features dynamically partitioned Translation Look-aside Buffers (TLBs) and hybrid L1 cache/scratchpad memory (SPM) functionality. It allows fine-grained per-thread control of resources, enabling the operating system to manage TLB replacements, including static overwrites, to ensure single-cycle address translation for critical memory regions. Additionally, CVA6-VMRT enables runtime partitioning of data and instruction caches into cache and SPM sections, providing low and predictable access times for critical data without impacting other accesses. In a virtualized setting, CVA6-VMRT enhances execution time determinism for critical guests by 94% during interference from non-critical guests, with minimal impact on their average absolute execution time compared to isolated execution of the critical guests only. This interference-aware behaviour is achieved with just a 4% area overhead and no timing penalty compared to the baseline CVA6 core.
中文摘要:本文提出CVA6-VMRT作为RISC-V CVA6核心的硬件扩展,通过分区TLB和混合缓存/便签存储器设计提升虚拟内存访问的可预测性,在仅增加4%面积开销下使关键任务的执行确定性提升94%。
English Summary: The paper introduces CVA6-VMRT, a hardware extension for the RISC-V CVA6 core that enhances virtual memory predictability through partitioned TLBs and hybrid cache/scratchpad functionality, achieving 94% better determinism for critical tasks with only 4% area overhead.

Authors:Peter D. Erickson, Tejas Sudharshan Mathai, Ronald M. Summers
Title: Class Imbalance Correction for Improved Universal Lesion Detection and Tagging in CT
Abstract:
Radiologists routinely detect and size lesions in CT to stage cancer and assess tumor burden. To potentially aid their efforts, multiple lesion detection algorithms have been developed with a large public dataset called DeepLesion (32,735 lesions, 32,120 CT slices, 10,594 studies, 4,427 patients, 8 body part labels). However, this dataset contains missing measurements and lesion tags, and exhibits a severe imbalance in the number of lesions per label category. In this work, we utilize a limited subset of DeepLesion (6\%, 1331 lesions, 1309 slices) containing lesion annotations and body part label tags to train a VFNet model to detect lesions and tag them. We address the class imbalance by conducting three experiments: 1) Balancing data by the body part labels, 2) Balancing data by the number of lesions per patient, and 3) Balancing data by the lesion size. In contrast to a randomly sampled (unbalanced) data subset, our results indicated that balancing the body part labels always increased sensitivity for lesions >= 1cm for classes with low data quantities (Bone: 80\% vs. 46\%, Kidney: 77\% vs. 61\%, Soft Tissue: 70\% vs. 60\%, Pelvis: 83\% vs. 76\%). Similar trends were seen for three other models tested (FasterRCNN, RetinaNet, FoveaBox). Balancing data by lesion size also helped the VFNet model improve recalls for all classes in contrast to an unbalanced dataset. We also provide a structured reporting guideline for a ``Lesions'' subsection to be entered into the ``Findings'' section of a radiology report. To our knowledge, we are the first to report the class imbalance in DeepLesion, and have taken data-driven steps to address it in the context of joint lesion detection and tagging.
Chinese: 研究者利用DeepLesion数据集的子集训练VFNet模型进行病灶检测与标记,通过数据平衡方法解决了类别不平衡问题,显著提升了小样本类别病灶的检测灵敏度并改善了整体召回率。
English: Researchers used a subset of the DeepLesion dataset to train a VFNet model for lesion detection and tagging, addressing class imbalance through data balancing methods that improved sensitivity for smaller lesion classes and overall recall.

Authors:Alexander Shieh, Tejas Sudharshan Mathai, Jianfei Liu, Angshuman Paul, Ronald M. Summers
Title: Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging
Abstract:
Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time. However, a lack of fully annotated data hinders the development of effective ULDT approaches. Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances. To address these issues, in this work, we developed a self-training pipeline for ULDT. A VFNet model was trained on a limited 11.5\% subset of DeepLesion (bounding boxes + tags) to detect and classify lesions in CT studies. Then, it identified and incorporated novel lesion candidates from a larger unseen data subset into its training set, and self-trained itself over multiple rounds. Multiple self-training experiments were conducted with different threshold policies to select predicted lesions with higher quality and cover the class imbalances. We discovered that direct self-training improved the sensitivities of over-represented lesion classes at the expense of under-represented classes. However, upsampling the lesions mined during self-training along with a variable threshold policy yielded a 6.5\% increase in sensitivity at 4 FP in contrast to self-training without class balancing (72\% vs 78.5\%) and a 11.7\% increase compared to the same self-training policy without upsampling (66.8\% vs 78.5\%). Furthermore, we show that our results either improved or maintained the sensitivity at 4FP for all 8 lesion classes.
中文: 本研究提出了一种基于VFNet的自训练流程,用于改进CT扫描中的通用病灶检测与标注,通过结合上采样和可变阈值策略,显著提升了所有病灶类别的检测灵敏度,并有效解决了数据集不平衡问题。
English: The study introduces a self-training pipeline using VFNet to enhance universal lesion detection and tagging in CT scans, which, when combined with upsampling and a variable threshold policy, significantly improves sensitivity across all lesion classes while addressing dataset imbalances.

Authors:Jared Frazier, Tejas Sudharshan Mathai, Jianfei Liu, Angshuman Paul, Ronald M. Summers
Title: 3D Universal Lesion Detection and Tagging in CT with Self-Training
Abstract:
Radiologists routinely perform the tedious task of lesion localization, classification, and size measurement in computed tomography (CT) studies. Universal lesion detection and tagging (ULDT) can simultaneously help alleviate the cumbersome nature of lesion measurement and enable tumor burden assessment. Previous ULDT approaches utilize the publicly available DeepLesion dataset, however it does not provide the full volumetric (3D) extent of lesions and also displays a severe class imbalance. In this work, we propose a self-training pipeline to detect 3D lesions and tag them according to the body part they occur in. We used a significantly limited 30\% subset of DeepLesion to train a VFNet model for 2D lesion detection and tagging. Next, the 2D lesion context was expanded into 3D, and the mined 3D lesion proposals were integrated back into the baseline training data in order to retrain the model over multiple rounds. Through the self-training procedure, our VFNet model learned from its own predictions, detected lesions in 3D, and tagged them. Our results indicated that our VFNet model achieved an average sensitivity of 46.9\% at [0.125:8] false positives (FP) with a limited 30\% data subset in comparison to the 46.8\% of an existing approach that used the entire DeepLesion dataset. To our knowledge, we are the first to jointly detect lesions in 3D and tag them according to the body part label.
中文: 本研究提出一种自训练流程,仅使用30%的DeepLesion数据集即可实现三维病灶检测和部位标注,在克服体积数据不足和类别不平衡问题的同时,取得了与现有方法相当的检测灵敏度。
English: This study introduces a self-training pipeline that enables 3D lesion detection and anatomical tagging using only 30% of the DeepLesion dataset, achieving comparable sensitivity to existing methods while overcoming volumetric data limitations and class imbalance.

Authors:Tejas Sudharshan Mathai, Sungwon Lee, Thomas C. Shen, Zhiyong Lu, Ronald M. Summers
Title: Universal Lymph Node Detection in Multiparametric MRI with Selective Augmentation
Abstract:
Robust localization of lymph nodes (LNs) in multiparametric MRI (mpMRI) is critical for the assessment of lymphadenopathy. Radiologists routinely measure the size of LN to distinguish benign from malignant nodes, which would require subsequent cancer staging. Sizing is a cumbersome task compounded by the diverse appearances of LNs in mpMRI, which renders their measurement difficult. Furthermore, smaller and potentially metastatic LNs could be missed during a busy clinical day. To alleviate these imaging and workflow problems, we propose a pipeline to universally detect both benign and metastatic nodes in the body for their ensuing measurement. The recently proposed VFNet neural network was employed to identify LN in T2 fat suppressed and diffusion weighted imaging (DWI) sequences acquired by various scanners with a variety of exam protocols. We also use a selective augmentation technique known as Intra-Label LISA (ILL) to diversify the input data samples the model sees during training, such that it improves its robustness during the evaluation phase. We achieved a sensitivity of $\sim$83\% with ILL vs. $\sim$80\% without ILL at 4 FP/vol. Compared with current LN detection approaches evaluated on mpMRI, we show a sensitivity improvement of $\sim$9\% at 4 FP/vol.
中文摘要:本研究提出了一种结合VFNet神经网络和标签内LISA增强技术的流程,用于在全身多参数MRI中普遍检测良性和转移性淋巴结,相比现有方法在每体积4个假阳性时灵敏度提高了约9%。
English Summary: The study introduces a pipeline using the VFNet neural network and Intra-Label LISA augmentation to universally detect lymph nodes in multiparametric MRI, improving sensitivity by approximately 9% at 4 false positives per volume compared to existing methods.

Authors:Yu Min Park, Yan Kyaw Tun, Walid Saad, Choong Seon Hong
Title: Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework
Abstract:
Beamforming is a key technology in millimeter-wave (mmWave) communications that improves signal transmission by optimizing directionality and intensity. However, conventional channel estimation methods, such as pilot signals or beam sweeping, often fail to adapt to rapidly changing communication environments. To address this limitation, multimodal sensing-aided beam prediction has gained significant attention, using various sensing data from devices such as LiDAR, radar, GPS, and RGB images to predict user locations or network conditions. Despite its promising potential, the adoption of multimodal sensing-aided beam prediction is hindered by high computational complexity, high costs, and limited datasets. Thus, in this paper, a resource-efficient learning approach is proposed to transfer knowledge from a multimodal network to a monomodal (radar-only) network based on cross-modal relational knowledge distillation (CRKD), while reducing computational overhead and preserving predictive accuracy. To enable multimodal learning with realistic data, a novel multimodal simulation framework is developed while integrating sensor data generated from the autonomous driving simulator CARLA with MATLAB-based mmWave channel modeling, and reflecting real-world conditions. The proposed CRKD achieves its objective by distilling relational information across different feature spaces, which enhances beam prediction performance without relying on expensive sensor data. Simulation results demonstrate that CRKD efficiently distills multimodal knowledge, allowing a radar-only model to achieve $94.62\%$ of the teacher performance. In particular, this is achieved with just $10\%$ of the teacher network's parameters, thereby significantly reducing computational complexity and dependence on multimodal sensor data.
中文: 本文提出了一种基于跨模态关系知识蒸馏(CRKD)的资源高效学习方法,将多模态感知知识迁移至仅使用雷达的单模态网络,在显著降低计算复杂度和数据依赖的同时保持了接近教师网络的性能。
English: This paper introduces a resource-efficient learning method using cross-modal relational knowledge distillation (CRKD) to transfer multimodal sensing knowledge to a radar-only network, achieving near-teacher performance with significantly reduced computational complexity and data dependency.

Authors:Jinxiang Lai, Wenlong Wu, Jiawei Zhan, Jian Li, Bin-Bin Gao, Jun Liu, Jie Zhang, Song Guo
Title: BoxSeg: Quality-Aware and Peer-Assisted Learning for Box-supervised Instance Segmentation
Abstract:
Box-supervised instance segmentation methods aim to achieve instance segmentation with only box annotations. Recent methods have demonstrated the effectiveness of acquiring high-quality pseudo masks under the teacher-student framework. Building upon this foundation, we propose a BoxSeg framework involving two novel and general modules named the Quality-Aware Module (QAM) and the Peer-assisted Copy-paste (PC). The QAM obtains high-quality pseudo masks and better measures the mask quality to help reduce the effect of noisy masks, by leveraging the quality-aware multi-mask complementation mechanism. The PC imitates Peer-Assisted Learning to further improve the quality of the low-quality masks with the guidance of the obtained high-quality pseudo masks. Theoretical and experimental analyses demonstrate the proposed QAM and PC are effective. Extensive experimental results show the superiority of our BoxSeg over the state-of-the-art methods, and illustrate the QAM and PC can be applied to improve other models.
中文:BoxSeg框架通过质量感知模块和同伴辅助复制粘贴技术,从框标注中生成高质量伪掩码,有效降低噪声影响,并在性能上超越现有先进方法。
English: The BoxSeg framework introduces a Quality-Aware Module and Peer-assisted Copy-paste to generate high-quality pseudo masks from box annotations, effectively reducing noise and outperforming current methods.

Authors:Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun
Title: Towards Visual Text Grounding of Multimodal Large Language Model
Abstract:
Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.
中文: 多模态大语言模型在处理富含文本的文档图像时存在视觉文本定位困难,为此提出了TRIG任务和数据集,通过合成数据和新方法对其能力进行基准测试和提升。
English: Multimodal Large Language Models face challenges in visual text grounding for text-rich document images, leading to the introduction of the TRIG task and dataset to benchmark and enhance their capabilities through synthetic data and novel methods.

Authors:Jiayun Li, Kay Pompetzki, An Thai Le, Haolei Tong, Jan Peters, Georgia Chalvatzaki
Title: Constrained Gaussian Process Motion Planning via Stein Variational Newton Inference
Abstract:
Gaussian Process Motion Planning (GPMP) is a widely used framework for generating smooth trajectories within a limited compute time--an essential requirement in many robotic applications. However, traditional GPMP approaches often struggle with enforcing hard nonlinear constraints and rely on Maximum a Posteriori (MAP) solutions that disregard the full Bayesian posterior. This limits planning diversity and ultimately hampers decision-making. Recent efforts to integrate Stein Variational Gradient Descent (SVGD) into motion planning have shown promise in handling complex constraints. Nonetheless, these methods still face persistent challenges, such as difficulties in strictly enforcing constraints and inefficiencies when the probabilistic inference problem is poorly conditioned. To address these issues, we propose a novel constrained Stein Variational Gaussian Process Motion Planning (cSGPMP) framework, incorporating a GPMP prior specifically designed for trajectory optimization under hard constraints. Our approach improves the efficiency of particle-based inference while explicitly handling nonlinear constraints. This advancement significantly broadens the applicability of GPMP to motion planning scenarios demanding robust Bayesian inference, strict constraint adherence, and computational efficiency within a limited time. We validate our method on standard benchmarks, achieving an average success rate of 98.57% across 350 planning tasks, significantly outperforming competitive baselines. This demonstrates the ability of our method to discover and use diverse trajectory modes, enhancing flexibility and adaptability in complex environments, and delivering significant improvements over standard baselines without incurring major computational costs.
中文: 提出的约束Stein变分高斯过程运动规划(cSGPMP)框架有效处理硬非线性约束并提升基于粒子的推理效率,在基准测试中达到98.57%成功率,显著优于基线方法且未增加计算负担。
English: The proposed constrained Stein Variational Gaussian Process Motion Planning (cSGPMP) framework effectively handles hard nonlinear constraints while improving particle-based inference efficiency, achieving a 98.57% success rate in benchmarks and outperforming baselines without major computational costs.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Enhancing Traffic Incident Response through Sub-Second Temporal Localization with HybridMamba
Abstract:
Traffic crash detection in long-form surveillance videos is essential for improving emergency response and infrastructure planning, yet remains difficult due to the brief and infrequent nature of crash events. We present \textbf{HybridMamba}, a novel architecture integrating visual transformers with state-space temporal modeling to achieve high-precision crash time localization. Our approach introduces multi-level token compression and hierarchical temporal processing to maintain computational efficiency without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of \textbf{1.50 seconds} for 2-minute videos ($p<0.01$ compared to baselines), with \textbf{65.2%} of predictions falling within one second of the ground truth. It outperforms recent video-language models (e.g., TimeChat, VideoLLaMA-2) by up to 3.95 seconds while using significantly fewer parameters (3B vs. 13--72B). Our results demonstrate effective temporal localization across various video durations (2--40 minutes) and diverse environmental conditions, highlighting HybridMamba's potential for fine-grained temporal localization in traffic surveillance while identifying challenges that remain for extended deployment.
中文:HybridMamba通过结合视觉变换器和状态空间时序建模的新架构,以显著更少的参数量实现了1.50秒平均绝对误差的高精度事故时间定位。
English: HybridMamba, a novel architecture combining visual transformers with state-space temporal modeling, achieves high-precision crash time localization with a mean absolute error of 1.50 seconds while using significantly fewer parameters than existing models.

Authors:Xiaolong Sun, Le Wang, Sanping Zhou, Liushuai Shi, Kun Xia, Mengnan Liu, Yabing Wang, Gang Hua
Title: Moment Quantization for Video Temporal Grounding
Abstract:
Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
Chinese: 提出的MQVTG方法通过可学习的时刻码本将视频时刻量化为离散向量,有效区分相关与无关时刻,在多个基准测试中显著优于现有方法。
English: The proposed MQVTG method enhances video temporal grounding by quantizing video moments into discrete vectors through a learnable codebook, effectively distinguishing relevant from irrelevant moments and outperforming existing methods across multiple benchmarks.

Authors:Noam Elata, Hyungjin Chung, Jong Chul Ye, Tomer Michaeli, Michael Elad
Title: InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems
Abstract:
Diffusion Models have demonstrated remarkable capabilities in handling inverse problems, offering high-quality posterior-sampling-based solutions. Despite significant advances, a fundamental trade-off persists, regarding the way the conditioned synthesis is employed: Training-based methods achieve high quality results, while zero-shot approaches trade this with flexibility. This work introduces a framework that combines the best of both worlds -- the strong performance of supervised approaches and the flexibility of zero-shot methods. This is achieved through a novel architectural design that seamlessly integrates the degradation operator directly into the denoiser. In each block, our proposed architecture applies the degradation operator on the network activations and conditions the output using the attention mechanism, enabling adaptation to diverse degradation scenarios while maintaining high performance. Our work demonstrates the versatility of the proposed architecture, operating as a general MMSE estimator, a posterior sampler, or a Neural Posterior Principal Component estimator. This flexibility enables a wide range of downstream tasks, highlighting the broad applicability of our framework. The proposed modification of the denoiser network offers a versatile, accurate, and computationally efficient solution, demonstrating the advantages of dedicated network architectures for complex inverse problems. Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance, surpassing both training-based and zero-shot alternatives.
中文摘要:本研究提出了一种新颖的扩散模型框架,通过将退化算子直接集成到去噪器架构中,在解决逆问题时同时实现了监督方法的高性能和零样本方法的灵活性。
English Summary: This work introduces a novel diffusion model framework that integrates degradation operators into the denoiser architecture, achieving both the high performance of supervised methods and the flexibility of zero-shot approaches for solving inverse problems.

Authors:Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, Ying Shan
Title: GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
Abstract:
Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
Chinese: GeometryCrafter是一种创新框架,通过点云变分自编码器和视频扩散模型,解决了现有视频深度估计方法几何保真度的不足,能生成高保真、时序一致的点云序列,实现精确的3D/4D重建和深度应用。
English: GeometryCrafter is a novel framework that overcomes the geometric fidelity limitations of existing video depth estimation methods by using a point map VAE and video diffusion model to produce high-fidelity, temporally coherent point map sequences for accurate 3D/4D reconstruction and depth-based applications.

Authors:Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun
Title: InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation
Abstract:
Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.
中文摘要:InformGen是一款基于大语言模型的辅助工具,通过优化文档解析和内容生成并结合人工干预,显著提高了知情同意书起草的合规性和事实准确性,其表现远超基础GPT-4o模型。
English Summary: InformGen is an LLM-powered tool that enhances the drafting of informed consent forms by ensuring high regulatory compliance and factual accuracy through optimized document processing and human oversight, significantly outperforming standard GPT-4o models.

Authors:Yilin Qi, Dong Won Lee, Cynthia Breazeal, Hae Won Park
Title: Does "Reasoning" with Large Language Models Improve Recognizing, Generating, and Reframing Unhelpful Thoughts?
Abstract:
Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation, and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs and augmented reasoning strategies such as CoT and self-consistency in enhancing LLMs' ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to "outdated" LLMs like GPT-3.5, consistently outperform state-of-the-art pretrained reasoning models on recognizing, generating and reframing unhelpful thoughts.
Chinese: 大型语言模型中的增强推理方法显著提升了认知重构任务的效果,在识别、生成和重构负面思维方面优于先进的预训练模型。
English: Augmented reasoning methods in large language models significantly enhance cognitive reframing tasks, outperforming even advanced pre-trained models in recognizing, generating, and reframing negative thoughts.

Authors:Zihan Chen, Xingbo Fu, Yushun Dong, Jundong Li, Cong Shen
Title: FedHERO: A Federated Learning Approach for Node Classification Task on Heterophilic Graphs
Abstract:
Federated Graph Learning (FGL) empowers clients to collaboratively train Graph neural networks (GNNs) in a distributed manner while preserving data privacy. However, FGL methods usually require that the graph data owned by all clients is homophilic to ensure similar neighbor distribution patterns of nodes. Such an assumption ensures that the learned knowledge is consistent across the local models from all clients. Therefore, these local models can be properly aggregated as a global model without undermining the overall performance. Nevertheless, when the neighbor distribution patterns of nodes vary across different clients (e.g., when clients hold graphs with different levels of heterophily), their local models may gain different and even conflict knowledge from their node-level predictive tasks. Consequently, aggregating these local models usually leads to catastrophic performance deterioration on the global model. To address this challenge, we propose FedHERO, an FGL framework designed to harness and share insights from heterophilic graphs effectively. At the heart of FedHERO is a dual-channel GNN equipped with a structure learner, engineered to discern the structural knowledge encoded in the local graphs. With this specialized component, FedHERO enables the local model for each client to identify and learn patterns that are universally applicable across graphs with different patterns of node neighbor distributions. FedHERO not only enhances the performance of individual client models by leveraging both local and shared structural insights but also sets a new precedent in this field to effectively handle graph data with various node neighbor distribution patterns. We conduct extensive experiments to validate the superior performance of FedHERO against existing alternatives.
中文: 联邦图学习(FGL)允许客户端在保护数据隐私的同时协作训练图神经网络,但当客户端持有具有不同节点邻居分布(如异质性)的图时,性能会显著下降,而FedHERO通过双通道图神经网络和结构学习器有效共享跨不同图模式的见解来解决这一问题。
English: Federated Graph Learning (FGL) enables collaborative GNN training while preserving data privacy, but struggles with performance degradation when clients have graphs with varying node neighbor distributions, such as heterophily, which FedHERO addresses by using a dual-channel GNN and structure learner to effectively share insights across diverse graph patterns.

Authors:Enes Özeren, Yihong Liu, Hinrich Schütze
Title: HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Abstract:
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.
中文:HYPEROFA采用基于超网络的方法,比OFA受限的初始化方式更自适应地生成低资源语言的词嵌入,在持续预训练收敛性和下游任务性能上均优于或匹配现有方法。
English: HYPEROFA introduces a hypernetwork-based approach to initialize token embeddings for low-resource languages more adaptively than OFA's constrained method, achieving superior or comparable performance in continual pre-training and downstream tasks.

Authors:Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
Title: The Leaderboard Illusion
Abstract:
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
中文: 该研究揭示Chatbot Arena排行榜因选择性披露性能数据和资源分配不均而产生扭曲,偏袒特定供应商,损害了AI模型评估的公正性。
English: The study reveals that Chatbot Arena's leaderboard is skewed by selective disclosure of performance results and unequal data distribution, favoring certain providers and compromising the fairness of AI model evaluations.

Authors:Jiuyu Liu, Yi Ma, Rahim Tafazolli
Title: SA-MIMO: Scalable Quantum-Based Wireless Communications
Abstract:
Rydberg atomic receivers offer a quantum-native alternative to conventional RF front-ends by directly detecting electromagnetic fields via highly excited atomic states. While their quantum-limited sensitivity and hardware simplicity make them promising for future wireless systems, extending their use to scalable multi-antenna and multi-carrier configurations, termed Scalable Atomic-MIMO (SA-MIMO), remains largely unexplored. This paper introduces a novel RF transmitter-atomic receiver architecture that addresses this gap. The core idea lies in a novel modulation technique called Phase-Rotated Symbol Spreading (PRSS), which transforms the nonlinear phase retrieval problem inherent to atomic detection into a tractable linear demultiplexing task. PRSS enables efficient signal processing and supports scalable MUX/DeMUX operations in both atomic MIMO and atomic OFDM systems. Simulation results show that the proposed system achieves up to 2.5 dB gain under optimal maximum-likelihood detection and over 10 dB under suboptimal detection in MIMO settings. These results establish PRSS assisted SA-MIMO as a promising architecture for realizing high-sensitivity, interference-resilient atomic wireless communication.
中文摘要:本文提出了一种新型相位旋转符号扩展技术,通过将非线性相位检测转化为线性解复用,实现了可扩展的原子MIMO和OFDM系统,在原子无线通信中获得了显著的性能提升。
English Summary: This paper introduces a novel Phase-Rotated Symbol Spreading (PRSS) technique that enables scalable atomic MIMO and OFDM systems by converting nonlinear phase detection into linear demultiplexing, achieving significant performance gains in atomic wireless communication.

Authors:Shuang Tian, Tao Zhang, Jiqiang Liu, Jiacheng Wang, Xuangou Wu, Xiaoqiang Zhu, Ruichen Zhang, Weiting Zhang, Zhenhui Yuan, Shiwen Mao, Dong In Kim
Title: Exploring the Role of Large Language Models in Cybersecurity: A Systematic Survey
Abstract:
With the rapid development of technology and the acceleration of digitalisation, the frequency and complexity of cyber security threats are increasing. Traditional cybersecurity approaches, often based on static rules and predefined scenarios, are struggling to adapt to the rapidly evolving nature of modern cyberattacks. There is an urgent need for more adaptive and intelligent defence strategies. The emergence of Large Language Model (LLM) provides an innovative solution to cope with the increasingly severe cyber threats, and its potential in analysing complex attack patterns, predicting threats and assisting real-time response has attracted a lot of attention in the field of cybersecurity, and exploring how to effectively use LLM to defend against cyberattacks has become a hot topic in the current research field. This survey examines the applications of LLM from the perspective of the cyber attack lifecycle, focusing on the three phases of defense reconnaissance, foothold establishment, and lateral movement, and it analyzes the potential of LLMs in Cyber Threat Intelligence (CTI) tasks. Meanwhile, we investigate how LLM-based security solutions are deployed and applied in different network scenarios. It also summarizes the internal and external risk issues faced by LLM during its application. Finally, this survey also points out the facing risk issues and possible future research directions in this domain.
中文摘要:本综述探讨了大语言模型在网络安全领域的应用,分析了其在攻击生命周期各阶段、不同网络场景部署及网络威胁情报任务中的潜力,同时指出了相关风险问题和未来研究方向。
English Summary: This survey explores the application of Large Language Models (LLMs) in cybersecurity, analyzing their potential across various attack phases, deployment scenarios, and Cyber Threat Intelligence tasks, while also addressing associated risks and future research directions.

Authors:Lander Besabe, Michele Girfoglio, Annalisa Quaini, Gianluigi Rozza
Title: Randomized Proper Orthogonal Decomposition for data-driven reduced order modeling of a two-layer quasi-geostrophic ocean model
Abstract:
The two-layer quasi-geostrophic equations (2QGE) serve as a simplified model for simulating wind-driven, stratified ocean flows. However, their numerical simulation remains computationally expensive due to the need for high-resolution meshes to capture a wide range of turbulent scales. This becomes especially problematic when several simulations need to be run because of, e.g., uncertainty in the parameter settings. To address this challenge, we propose a data-driven reduced order model (ROM) for the 2QGE that leverages randomized proper orthogonal decomposition (rPOD) and long short-term memory (LSTM) networks. To efficiently generate the snapshot data required for model construction, we apply a nonlinear filtering stabilization technique that allows for the use of larger mesh sizes compared to a direct numerical simulations (DNS). Thanks to the use of rPOD to extract the dominant modes from the snapshot matrices, we achieve up to 700 times speedup over the use of deterministic POD. LSTM networks are trained with the modal coefficients associated with the snapshots to enable the prediction of the time- and parameter-dependent modal coefficients during the online phase, which is hundreds of thousands of time faster than a DNS. We assess the accuracy and efficiency of our rPOD-LSTM ROM through an extension of a well-known benchmark called double-gyre wind forcing test. The dimension of the parameter space in this test is increased from two to four.
所提出的数据驱动降阶模型结合了随机化本征正交分解和长短期记忆网络,在保持精度的同时显著加速了双层准地转方程的数值模拟。
The proposed data-driven reduced order model combines randomized proper orthogonal decomposition and LSTM networks to significantly accelerate numerical simulations of two-layer quasi-geostrophic equations while maintaining accuracy.

Authors:Yaqian Chen, Lin Li, Hanxue Gu, Haoyu Dong, Derek L. Nguyen, Allan D. Kirk, Maciej A. Mazurowski, E. Shelley Hwang
Title: Breast density in MRI: an AI-based quantification and relationship to assessment in mammography
Abstract:
Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-house machine-learning algorithm to assess breast density on normal breasts in three MRI datasets. Breast density was consistent across different datasets (0.104 - 0.114). Analysis across different age groups also demonstrated strong consistency across datasets and confirmed a trend of decreasing density with age as reported in previous studies. MR breast density was correlated with mammographic breast density, although some notable differences suggest that certain breast density components are captured only on MRI. Future work will determine how to integrate MR breast density with current tools to improve future breast cancer risk prediction.
Chinese: 本研究应用机器学习算法评估乳腺磁共振成像密度,发现不同数据集结果一致且与乳腺X线密度相关,未来研究将整合磁共振数据以改进乳腺癌风险预测。
English: This study used a machine-learning algorithm to assess breast density on MRI, finding consistent results across datasets and a correlation with mammographic density, with future research aimed at integrating MRI data to enhance breast cancer risk prediction.

Authors:Yuheng Huang, Lei Ma, Keizaburo Nishikino, Takumi Akazaki
Title: Risk Assessment Framework for Code LLMs via Leveraging Internal States
Abstract:
The pre-training paradigm plays a key role in the success of Large Language Models (LLMs), which have been recognized as one of the most significant advancements of AI recently. Building on these breakthroughs, code LLMs with advanced coding capabilities bring huge impacts on software engineering, showing the tendency to become an essential part of developers' daily routines. However, the current code LLMs still face serious challenges related to trustworthiness, as they can generate incorrect, insecure, or unreliable code. Recent exploratory studies find that it can be promising to detect such risky outputs by analyzing LLMs' internal states, akin to how the human brain unconsciously recognizes its own mistakes. Yet, most of these approaches are limited to narrow sub-domains of LLM operations and fall short of achieving industry-level scalability and practicability. To address these challenges, in this paper, we propose PtTrust, a two-stage risk assessment framework for code LLM based on internal state pre-training, designed to integrate seamlessly with the existing infrastructure of software companies. The core idea is that the risk assessment framework could also undergo a pre-training process similar to LLMs. Specifically, PtTrust first performs unsupervised pre-training on large-scale unlabeled source code to learn general representations of LLM states. Then, it uses a small, labeled dataset to train a risk predictor. We demonstrate the effectiveness of PtTrust through fine-grained, code line-level risk assessment and demonstrate that it generalizes across tasks and different programming languages. Further experiments also reveal that PtTrust provides highly intuitive and interpretable features, fostering greater user trust. We believe PtTrust makes a promising step toward scalable and trustworthy assurance for code LLMs.
中文摘要:预训练对大型语言模型至关重要,但代码模型存在可信度问题,因此提出PtTrust框架,通过内部状态预训练实现可扩展的风险评估。
English Summary: Pre-training is crucial for large language models (LLMs), but code LLMs face trustworthiness issues, leading to the proposed PtTrust framework that uses internal state pre-training for scalable risk assessment.

Authors:Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Minyi Guo
Title: Optimizing SLO-oriented LLM Serving with PD-Multiplexing
Abstract:
Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average $5.1\times$ throughput improvement (up to $17.5\times$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.
中文摘要:Drift是一种新型LLM服务框架,通过PD多路复用技术解决了现代LLM服务中吞吐量与SLO之间的固有矛盾,在保证服务等级目标的同时实现了吞吐量的大幅提升。
English Summary: Drift is a novel LLM serving framework that resolves the throughput-SLO tradeoff in modern LLM services through PD multiplexing, achieving significant throughput gains while consistently meeting service level objectives.

Authors:Runzhen Xue, Hao Wu, Mingyu Yan, Ziheng Xiao, Xiaochun Ye, Dongrui Fan
Title: MetaDSE: A Few-shot Meta-learning Framework for Cross-workload CPU Design Space Exploration
Abstract:
Cross-workload design space exploration (DSE) is crucial in CPU architecture design. Existing DSE methods typically employ the transfer learning technique to leverage knowledge from source workloads, aiming to minimize the requirement of target workload simulation. However, these methods struggle with overfitting, data ambiguity, and workload dissimilarity. To address these challenges, we reframe the cross-workload CPU DSE task as a few-shot meta-learning problem and further introduce MetaDSE. By leveraging model agnostic meta-learning, MetaDSE swiftly adapts to new target workloads, greatly enhancing the efficiency of cross-workload CPU DSE. Additionally, MetaDSE introduces a novel knowledge transfer method called the workload-adaptive architectural mask algorithm, which uncovers the inherent properties of the architecture. Experiments on SPEC CPU 2017 demonstrate that MetaDSE significantly reduces prediction error by 44.3\% compared to the state-of-the-art. MetaDSE is open-sourced and available at this \href{https://anonymous.4open.science/r/Meta_DSE-02F8}{anonymous GitHub.}
中文:MetaDSE采用元学习方法改进跨工作负载CPU设计空间探索,通过快速适应新工作负载和创新的架构掩码算法,将预测误差显著降低了44.3%。
English: MetaDSE introduces a meta-learning approach to cross-workload CPU design space exploration, reducing prediction errors by 44.3% through rapid adaptation to new workloads and a novel architectural mask algorithm.

Authors:Omar Alnaseri, Yassine Himeur, Shadi Atalla, Wathiq Mansoor
Title: Complexity of Post-Quantum Cryptography in Embedded Systems and Its Optimization Strategies
Abstract:
With the rapid advancements in quantum computing, traditional cryptographic schemes like Rivest-Shamir-Adleman (RSA) and elliptic curve cryptography (ECC) are becoming vulnerable, necessitating the development of quantum-resistant algorithms. The National Institute of Standards and Technology (NIST) has initiated a standardization process for PQC algorithms, and several candidates, including CRYSTALS-Kyber and McEliece, have reached the final stages. This paper first provides a comprehensive analysis of the hardware complexity of post-quantum cryptography (PQC) in embedded systems, categorizing PQC algorithms into families based on their underlying mathematical problems: lattice-based, code-based, hash-based and multivariate / isogeny-based schemes. Each family presents distinct computational, memory, and energy profiles, making them suitable for different use cases. To address these challenges, this paper discusses optimization strategies such as pipelining, parallelization, and high-level synthesis (HLS), which can improve the performance and energy efficiency of PQC implementations. Finally, a detailed complexity analysis of CRYSTALS-Kyber and McEliece, comparing their key generation, encryption, and decryption processes in terms of computational complexity, has been conducted.
中文: 本文分析了后量子密码算法在嵌入式系统中的硬件复杂性,按数学基础对其分类,并探讨了提升性能和能效的优化策略。
English: This paper analyzes the hardware complexity of post-quantum cryptography algorithms in embedded systems, categorizing them by mathematical foundations and discussing optimization strategies to enhance performance and energy efficiency.

Authors:Farhad Nawaz, Minjun Sung, Darshan Gadginmath, Jovin D'sa, Sangjae Bae, David Isele, Nadia Figueroa, Nikolai Matni, Faizan M. Tariq
Title: Graph-based Path Planning with Dynamic Obstacle Avoidance for Autonomous Parking
Abstract:
Safe and efficient path planning in parking scenarios presents a significant challenge due to the presence of cluttered environments filled with static and dynamic obstacles. To address this, we propose a novel and computationally efficient planning strategy that seamlessly integrates the predictions of dynamic obstacles into the planning process, ensuring the generation of collision-free paths. Our approach builds upon the conventional Hybrid A star algorithm by introducing a time-indexed variant that explicitly accounts for the predictions of dynamic obstacles during node exploration in the graph, thus enabling dynamic obstacle avoidance. We integrate the time-indexed Hybrid A star algorithm within an online planning framework to compute local paths at each planning step, guided by an adaptively chosen intermediate goal. The proposed method is validated in diverse parking scenarios, including perpendicular, angled, and parallel parking. Through simulations, we showcase our approach's potential in greatly improving the efficiency and safety when compared to the state of the art spline-based planning method for parking situations.
Chinese: 本研究提出了一种时间索引的混合A*算法,将动态障碍物预测融入路径规划过程,在多种停车场景中相比现有方法显著提升了安全性和效率。
English: This study introduces a time-indexed Hybrid A* algorithm that integrates dynamic obstacle predictions into path planning, significantly enhancing safety and efficiency in various parking scenarios compared to existing methods.

Authors:Aoran Liu, Kun Hu, Clinton Mo, Changyang Li, Zhiyong Wang
Title: Extended Short- and Long-Range Mesh Learning for Fast and Generalized Garment Simulation
Abstract:
3D garment simulation is a critical component for producing cloth-based graphics. Recent advancements in graph neural networks (GNNs) offer a promising approach for efficient garment simulation. However, GNNs require extensive message-passing to propagate information such as physical forces and maintain contact awareness across the entire garment mesh, which becomes computationally inefficient at higher resolutions. To address this, we devise a novel GNN-based mesh learning framework with two key components to extend the message-passing range with minimal overhead, namely the Laplacian-Smoothed Dual Message-Passing (LSDMP) and the Geodesic Self-Attention (GSA) modules. LSDMP enhances message-passing with a Laplacian features smoothing process, which efficiently propagates the impact of each vertex to nearby vertices. Concurrently, GSA introduces geodesic distance embeddings to represent the spatial relationship between vertices and utilises attention mechanisms to capture global mesh information. The two modules operate in parallel to ensure both short- and long-range mesh modelling. Extensive experiments demonstrate the state-of-the-art performance of our method, requiring fewer layers and lower inference latency.
中文: 本文提出了一种新颖的基于图神经网络的框架,通过拉普拉斯平滑双消息传递和测地线自注意力模块,高效扩展3D服装模拟中的消息传递范围,以更少的网络层和更低延迟实现了最先进的性能。
English: This paper introduces a novel GNN-based framework with Laplacian-Smoothed Dual Message-Passing and Geodesic Self-Attention modules to efficiently extend message-passing range in 3D garment simulation, achieving state-of-the-art performance with fewer layers and lower latency.

Authors:Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, Sinno Jialin Pan
Title: Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM's preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.
Chinese: 本研究提出了一种三编码器顺序检索器,将组合检索建模为马尔可夫决策过程,通过条件化文档选择并显式捕捉示例间依赖关系,实验证明其性能显著优于基线方法。
English: This study introduces a tri-encoder sequential retriever that models compositional retrieval as a Markov Decision Process, enabling conditional document selection and demonstrating superior performance over baseline methods by explicitly capturing dependencies between examples.

Authors:Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong
Title: A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Abstract:
Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.
中文: 强化学习微调方法如GRPO的有效性主要源于剔除完全错误回答的提示,而一种名为Reinforce-Rej的简化方法通过同时排除完全错误和完全正确的样本,提供了更高效稳定的替代方案。
English: Reinforcement learning fine-tuning methods like GRPO are effective primarily by filtering out prompts with completely wrong responses, and a simpler approach called Reinforce-Rej, which excludes both entirely incorrect and correct samples, offers a more efficient and stable alternative.

Authors:Minqian Liu, Zhiyang Xu, Xinyi Zhang, Heajun An, Sarvech Qadir, Qi Zhang, Pamela J. Wisniewski, Jin-Hee Cho, Sang Won Lee, Ruoxi Jia, Lifu Huang
Title: LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
Abstract:
Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.
大型语言模型存在显著的劝说安全风险,无法有效拒绝不道德任务并频繁使用有害策略,这一发现基于覆盖多种不道德主题和手法的综合评估框架。
Large language models exhibit concerning persuasion safety risks by failing to reject unethical tasks and employing harmful strategies, as revealed through a comprehensive assessment framework covering multiple unethical topics and tactics.

Authors:Eugene Yang, Nicola Tonellotto, Dawn Lawrie, Sean MacAvaney, James Mayfield, Douglas W. Oard, Scott Miller
Title: MURR: Model Updating with Regularized Replay for Searching a Document Stream
Abstract:
The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.
中文: 针对网络内容不断变化导致神经检索模型过时的问题,MURR方法通过正则化回放机制实现模型更新,无需重新处理历史文档即可保持动态环境中的检索效果一致性。
English: Neural retrieval models face obsolescence with evolving online content, but MURR enables effective updates without reprocessing past documents, maintaining retrieval consistency in dynamic environments.

Authors:Heming Xu, Xiaohui Liu, Zhilu Zhang, Hongzhi Zhang, Xiaohe Wu, Wangmeng Zuo
Title: Pseudo-Label Guided Real-World Image De-weathering: A Learning Framework with Imperfect Supervision
Abstract:
Real-world image de-weathering aims at removingvarious undesirable weather-related artifacts, e.g., rain, snow,and fog. To this end, acquiring ideal training pairs is crucial.Existing real-world datasets are typically constructed paired databy extracting clean and degraded images from live streamsof landscape scene on the Internet. Despite the use of strictfiltering mechanisms during collection, training pairs inevitablyencounter inconsistency in terms of lighting, object position, scenedetails, etc, making de-weathering models possibly suffer fromdeformation artifacts under non-ideal supervision. In this work,we propose a unified solution for real-world image de-weatheringwith non-ideal supervision, i.e., a pseudo-label guided learningframework, to address various inconsistencies within the realworld paired dataset. Generally, it consists of a de-weatheringmodel (De-W) and a Consistent Label Constructor (CLC), bywhich restoration result can be adaptively supervised by originalground-truth image to recover sharp textures while maintainingconsistency with the degraded inputs in non-weather contentthrough the supervision of pseudo-labels. Particularly, a Crossframe Similarity Aggregation (CSA) module is deployed withinCLC to enhance the quality of pseudo-labels by exploring thepotential complementary information of multi-frames throughgraph model. Moreover, we introduce an Information AllocationStrategy (IAS) to integrate the original ground-truth imagesand pseudo-labels, thereby facilitating the joint supervision forthe training of de-weathering model. Extensive experimentsdemonstrate that our method exhibits significant advantageswhen trained on imperfectly aligned de-weathering datasets incomparison with other approaches.
中文: 本文提出了一种伪标签引导的学习框架,通过跨帧相似性聚合和信息分配策略,结合原始真实图像监督与自适应伪标签,有效解决了真实世界图像去天气化训练中数据不一致的问题。
English: This paper introduces a pseudo-label guided learning framework for real-world image de-weathering that addresses inconsistencies in training pairs by combining ground-truth supervision with adaptive pseudo-labels through cross-frame similarity aggregation and information allocation strategies.

Authors:Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
Title: MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Abstract:
We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench
中文摘要:MLRC-Bench是一个动态基准测试,旨在通过机器学习研究竞赛严格评估语言智能体提出和实现创新方法的能力,结果显示当前AI系统与人类研究者之间存在显著性能差距。
English Summary: MLRC-Bench is a dynamic benchmark designed to rigorously evaluate language agents' abilities to propose and implement novel methodologies in machine learning research competitions, revealing significant performance gaps between current AI systems and human researchers.

Authors:Boyang Deng, Songyou Peng, Kyle Genova, Gordon Wetzstein, Noah Snavely, Leonidas Guibas, Thomas Funkhouser
Title: Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Abstract:
We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.
中文: 该系统利用多模态大语言模型分析数百万张城市时序图像,通过可扩展的自底向上方法实现开放式共现变化发现,性能显著优于现有基准。
English: This system employs Multimodal LLMs to analyze millions of temporal urban images, enabling open-ended discovery of co-occurring changes through a scalable bottom-up approach that outperforms existing methods.

Authors:Neil Reichlin, Nicolas Baumann, Edoardo Ghignone, Michele Magno
Title: TinyCenterSpeed: Efficient Center-Based Object Detection for Autonomous Racing
Abstract:
Perception within autonomous driving is nearly synonymous with Neural Networks (NNs). Yet, the domain of autonomous racing is often characterized by scaled, computationally limited robots used for cost-effectiveness and safety. For this reason, opponent detection and tracking systems typically resort to traditional computer vision techniques due to computational constraints. This paper introduces TinyCenterSpeed, a streamlined adaptation of the seminal CenterPoint method, optimized for real-time performance on 1:10 scale autonomous racing platforms. This adaptation is viable even on OBCs powered solely by Central Processing Units (CPUs), as it incorporates the use of an external Tensor Processing Unit (TPU). We demonstrate that, compared to Adaptive Breakpoint Detector (ABD), the current State-of-the-Art (SotA) in scaled autonomous racing, TinyCenterSpeed not only improves detection and velocity estimation by up to 61.38% but also supports multi-opponent detection and estimation. It achieves real-time performance with an inference time of just 7.88 ms on the TPU, significantly reducing CPU utilization 8.3-fold.
中文: 本文提出TinyCenterSpeed,这是针对CPU驱动的自动驾驶赛车平台优化的轻量级CenterPoint版本,相比现有方法将检测性能提升61.38%,同时实现多目标实时追踪且推理时间仅需7.88毫秒。
English: This paper presents TinyCenterSpeed, a lightweight version of CenterPoint optimized for real-time opponent detection on CPU-powered autonomous racing platforms, achieving a 61.38% improvement over existing methods while enabling multi-opponent tracking with minimal inference time.

Authors:Fucheng Jia, Zewen Wu, Shiqi Jiang, Huiqiang Jiang, Qianxi Zhang, Yuqing Yang, Yunxin Liu, Ju Ren, Deyu Zhang, Ting Cao
Title: Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash
Abstract:
Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.
中文摘要:ActiveFlow是一种创新的LLM推理框架,通过主动权重交换和三项新技术实现动态DRAM使用,使得更大模型能在移动设备上运行,并在性能与成本效率方面达到最优。
English Summary: ActiveFlow is a novel LLM inference framework that enables adaptive DRAM usage through active weight swapping and three innovative techniques, allowing larger models to run on mobile devices while achieving superior performance-cost efficiency.

Authors:Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang
Title: STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge
Abstract:
Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.
Chinese: "imaplus"团队提出的STSeg方案通过在MOSE数据集上微调SAM2和TMO模型,并采用自适应伪标签引导的模型优化流程,在复杂场景视频对象分割中实现了最优性能,荣获2025年PVUW挑战赛冠军。
English: The STSeg solution, developed by the "imaplus" team, finetunes SAM2 and TMO models on the MOSE dataset and employs an adaptive refinement pipeline to achieve state-of-the-art performance in complex video object segmentation, winning first place in the 2025 PVUW Challenge.

Authors:Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu
Title: AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
Abstract:
AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that most of the competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM's practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.
中文摘要:本研究针对AI生成文本的质量评估难题,通过建立写作质量基准和训练专用奖励模型,显著提升了符合人类偏好的优质文本筛选能力。
English Summary: This research addresses the challenge of evaluating AI-generated text quality by introducing a Writing Quality Benchmark and specialized reward models that significantly outperform existing methods in selecting human-preferred writing.

Authors:Krzysztof Byrski, Jacek Tabor, Przemysław Spurek, Marcin Mazur
Title: CEC-MMR: Cross-Entropy Clustering Approach to Multi-Modal Regression
Abstract:
In practical applications of regression analysis, it is not uncommon to encounter a multitude of values for each attribute. In such a situation, the univariate distribution, which is typically Gaussian, is suboptimal because the mean may be situated between modes, resulting in a predicted value that differs significantly from the actual data. Consequently, to address this issue, a mixture distribution with parameters learned by a neural network, known as a Mixture Density Network (MDN), is typically employed. However, this approach has an important inherent limitation, in that it is not feasible to ascertain the precise number of components with a reasonable degree of accuracy. In this paper, we introduce CEC-MMR, a novel approach based on Cross-Entropy Clustering (CEC), which allows for the automatic detection of the number of components in a regression problem. Furthermore, given an attribute and its value, our method is capable of uniquely identifying it with the underlying component. The experimental results demonstrate that CEC-MMR yields superior outcomes compared to classical MDNs.
中文: 本文提出基于交叉熵聚类的CEC-MMR方法,能够自动确定回归中的组件数量,克服了混合密度网络的限制,通过唯一识别组件获得更优结果。
English: The paper introduces CEC-MMR, a method using Cross-Entropy Clustering to automatically determine the number of components in regression, overcoming the limitations of Mixture Density Networks by uniquely identifying components and achieving better results.

Authors:Israfel Salazar, Manuel Fernández Burda, Shayekh Bin Islam, Arshia Soltani Moakhar, Shivalika Singh, Fabian Farestam, Angelika Romanou, Danylo Boiko, Dipika Khullar, Mike Zhang, Dominik Krzemiński, Jekaterina Novikova, Luísa Shimabucoro, Joseph Marvin Imperial, Rishabh Maheshwary, Sharad Duwal, Alfonso Amayuelas, Swati Rajwal, Jebish Purbey, Ahmed Ruby, Nicholas Popovič, Marek Suppa, Azmine Toushik Wasi, Ram Mohan Rao Kadiyala, Olga Tsymboi, Maksim Kostritsya, Bardia Soltani Moakhar, Gabriel da Costa Merlin, Otávio Ferracioli Coletti, Maral Jabbari Shiviari, MohammadAmin farahani fard, Silvia Fernandez, María Grandury, Dmitry Abulkhanov, Drishti Sharma, Andre Guarnier De Mitri, Leticia Bossatto Marchezi, Setayesh Heydari, Johan Obando-Ceron, Nazar Kohut, Beyza Ermis, Desmond Elliott, Enzo Ferrante, Sara Hooker, Marzieh Fadaee
Title: Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation
Abstract:
The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.
中文:Kaleidoscope被提出作为迄今最全面的多语言视觉语言模型评测基准,涵盖18种语言和14个科目,通过文化真实性问题揭示了现有模型在低资源语言和复杂多模态场景中的显著性能不足。
English: Kaleidoscope is introduced as the most comprehensive multilingual vision-language model benchmark to date, covering 18 languages and 14 subjects with culturally authentic questions, revealing significant performance gaps in low-resource languages and complex multimodal scenarios.

Authors:Yuze Jiang, Ehsan Javanmardi, Manabu Tsukada, Hiroshi Esaki
Title: Towards Efficient Roadside LiDAR Deployment: A Fast Surrogate Metric Based on Entropy-Guided Visibility
Abstract:
The deployment of roadside LiDAR sensors plays a crucial role in the development of Cooperative Intelligent Transport Systems (C-ITS). However, the high cost of LiDAR sensors necessitates efficient placement strategies to maximize detection performance. Traditional roadside LiDAR deployment methods rely on expert insight, making them time-consuming. Automating this process, however, demands extensive computation, as it requires not only visibility evaluation but also assessing detection performance across different LiDAR placements. To address this challenge, we propose a fast surrogate metric, the Entropy-Guided Visibility Score (EGVS), based on information gain to evaluate object detection performance in roadside LiDAR configurations. EGVS leverages Traffic Probabilistic Occupancy Grids (TPOG) to prioritize critical areas and employs entropy-based calculations to quantify the information captured by LiDAR beams. This eliminates the need for direct detection performance evaluation, which typically requires extensive labeling and computational resources. By integrating EGVS into the optimization process, we significantly accelerate the search for optimal LiDAR configurations. Experimental results using the AWSIM simulator demonstrate that EGVS strongly correlates with Average Precision (AP) scores and effectively predicts object detection performance. This approach offers a computationally efficient solution for roadside LiDAR deployment, facilitating scalable smart infrastructure development.
中文摘要:提出的熵引导可见性评分(EGVS)通过信息增益计算评估检测性能,有效优化路边激光雷达布局,无需计算密集的直接评估即可保持与目标检测精度的强相关性。
English Summary: The proposed Entropy-Guided Visibility Score (EGVS) efficiently optimizes roadside LiDAR placement by evaluating detection performance through information gain calculations, eliminating the need for computationally intensive direct evaluations while maintaining strong correlation with object detection accuracy.

Authors:Shijie Liu, Ruixing Ding, Weihai Lu, Jun Wang, Mo Yu, Xiaoming Shi, Wei Zhang
Title: Coherency Improved Explainable Recommendation via Large Language Model
Abstract:
Explainable recommender systems are designed to elucidate the explanation behind each recommendation, enabling users to comprehend the underlying logic. Previous works perform rating prediction and explanation generation in a multi-task manner. However, these works suffer from incoherence between predicted ratings and explanations. To address the issue, we propose a novel framework that employs a large language model (LLM) to generate a rating, transforms it into a rating vector, and finally generates an explanation based on the rating vector and user-item information. Moreover, we propose utilizing publicly available LLMs and pre-trained sentiment analysis models to automatically evaluate the coherence without human annotations. Extensive experimental results on three datasets of explainable recommendation show that the proposed framework is effective, outperforming state-of-the-art baselines with improvements of 7.3\% in explainability and 4.4\% in text quality.
中文: 本文提出了一种新颖框架,利用大型语言模型生成推荐系统的连贯评分与解释,在可解释性和文本质量方面显著超越了现有最优方法。
English: This paper introduces a novel framework that leverages large language models to generate coherent ratings and explanations for recommender systems, achieving significant improvements in explainability and text quality over existing methods.

Authors:Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, Ranjay Krishna
Title: Seeking and Updating with Live Visual Knowledge
Abstract:
The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets. To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: https://livevqa.github.io.
中文: LiveVQA数据集旨在解决多模态大语言模型在处理实时视觉信息时的滞后问题,揭示了模型在最新内容上的显著性能差距,并证明工具使用框架和微调方法能大幅提升其更新知识的能力。
English: The LiveVQA dataset is introduced to address the stagnation of Multimodal Large Language Models (MLLMs) in handling live visual information, revealing significant performance gaps and demonstrating that tool-use frameworks and fine-tuning methods can substantially improve their ability to process up-to-date content.

Authors:Cheng Chen, Jiacheng Wei, Tianrun Chen, Chi Zhang, Xiaofeng Yang, Shangzhan Zhang, Bingchen Yang, Chuan-Sheng Foo, Guosheng Lin, Qixing Huang, Fayao Liu
Title: CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images
Abstract:
Creating CAD digital twins from the physical world is crucial for manufacturing, design, and simulation. However, current methods typically rely on costly 3D scanning with labor-intensive post-processing. To provide a user-friendly design process, we explore the problem of reverse engineering from unconstrained real-world CAD images that can be easily captured by users of all experiences. However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image-to-parametric CAD model generation framework that trains solely on synthetic textureless CAD data while testing on real-world images. To bridge the significant representation disparity between images and parametric CAD models, we introduce a geometry encoder to accurately capture diverse geometric features. Moreover, the texture-invariant properties of the geometric features can also facilitate the generalization to real-world scenarios. Since compiling CAD parameter sequences into explicit CAD models is a non-differentiable process, the network training inherently lacks explicit geometric supervision. To impose geometric validity constraints, we employ direct preference optimization (DPO) to fine-tune our model with the automatic code checker feedback on CAD sequence quality. Furthermore, we collected a real-world dataset, comprised of multi-view images and corresponding CAD command sequence pairs, to evaluate our method. Experimental results demonstrate that our approach can robustly handle real unconstrained CAD images, and even generalize to unseen general objects.
中文摘要:本研究提出了CADCrafter框架,通过几何编码器和直接偏好优化技术,仅使用合成无纹理数据训练即可从真实图像生成参数化CAD模型,成功解决了数据稀缺和几何监督难题,并展现出对未知物体的泛化能力。
English Summary: The study introduces CADCrafter, a framework that generates parametric CAD models from synthetic textureless data and generalizes to real-world images using a geometry encoder and direct preference optimization, effectively overcoming data scarcity and geometric supervision challenges.

Authors:Mingyang Wang, Heike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Strötgen, Hinrich Schütze
Title: Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Abstract:
Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.
中文: 多语言模型在多数层中将知识编码于语言无关的概念空间,但在最后几层向语言特定空间转换时易出错,导致跨语言事实回答不一致;通过线性捷径方法绕过末层计算可有效提升准确性与一致性。
English: Multilingual language models encode knowledge in language-independent concept spaces but often fail during the transition to language-specific layers, causing inconsistent factual responses across languages, which can be improved by a linear shortcut method bypassing final layer computations.

Authors:Senkang Hu, Yanan Ma, Yihang Tao, Zhengru Fang, Zihan Fang, Yiqin Deng, Sam Kwong, Yuguang Fang
Title: Task-Aware Parameter-Efficient Fine-Tuning of Large Pre-Trained Models at the Edge
Abstract:
Large language models (LLMs) have achieved remarkable success in various tasks, such as decision-making, reasoning, and question answering. They have been widely used in edge devices. However, fine-tuning LLMs to specific tasks at the edge is challenging due to the high computational cost and the limited storage and energy resources at the edge. To address this issue, we propose TaskEdge, a task-aware parameter-efficient fine-tuning framework at the edge, which allocates the most effective parameters to the target task and only updates the task-specific parameters. Specifically, we first design a parameter importance calculation criterion that incorporates both weights and input activations into the computation of weight importance. Then, we propose a model-agnostic task-specific parameter allocation algorithm to ensure that task-specific parameters are distributed evenly across the model, rather than being concentrated in specific regions. In doing so, TaskEdge can significantly reduce the computational cost and memory usage while maintaining performance on the target downstream tasks by updating less than 0.1\% of the parameters. In addition, TaskEdge can be easily integrated with structured sparsity to enable acceleration by NVIDIA's specialized sparse tensor cores, and it can be seamlessly integrated with LoRA to enable efficient sparse low-rank adaptation. Extensive experiments on various tasks demonstrate the effectiveness of TaskEdge.
中文: TaskEdge是一种参数高效微调框架,通过仅更新不到0.1%的任务特定参数,在保证性能的同时解决了大型语言模型在边缘设备上适应任务时面临的计算资源限制问题。
English: TaskEdge is a parameter-efficient fine-tuning framework that addresses the computational challenges of adapting large language models for edge devices by selectively updating less than 0.1% of task-specific parameters while maintaining performance.

Authors:Weibin Liao, Xin Gao, Tianyu Jia, Rihong Qiu, Yifan Zhu, Yang Lin, Xu Chu, Junfeng Zhao, Yasha Wang
Title: LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models
Abstract:
Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.
中文:LearNAT框架通过结合AST引导的任务分解、边界感知强化学习和自适应示例推理,显著提升了开源大语言模型在复杂NL2SQL任务中的表现,使70亿参数模型在基准测试中达到与GPT-4相当的水平。
English: The LearNAT framework enhances open-source LLMs' performance on complex NL2SQL tasks by integrating AST-guided task decomposition, margin-aware reinforcement learning, and adaptive demonstration reasoning, enabling a 7B model to match GPT-4's effectiveness on benchmarks.

Authors:Gabriele Greco, Carlo Cena, Umberto Albertin, Mauro Martini, Marcello Chiaberge
Title: Fault injection analysis of Real NVP normalising flow model for satellite anomaly detection
Abstract:
Satellites are used for a multitude of applications, including communications, Earth observation, and space science. Neural networks and deep learning-based approaches now represent the state-of-the-art to enhance the performance and efficiency of these tasks. Given that satellites are susceptible to various faults, one critical application of Artificial Intelligence (AI) is fault detection. However, despite the advantages of neural networks, these systems are vulnerable to radiation errors, which can significantly impact their reliability. Ensuring the dependability of these solutions requires extensive testing and validation, particularly using fault injection methods. This study analyses a physics-informed (PI) real-valued non-volume preserving (Real NVP) normalizing flow model for fault detection in space systems, with a focus on resilience to Single-Event Upsets (SEUs). We present a customized fault injection framework in TensorFlow to assess neural network resilience. Fault injections are applied through two primary methods: Layer State injection, targeting internal network components such as weights and biases, and Layer Output injection, which modifies layer outputs across various activations. Fault types include zeros, random values, and bit-flip operations, applied at varying levels and across different network layers. Our findings reveal several critical insights, such as the significance of bit-flip errors in critical bits, that can lead to substantial performance degradation or even system failure. With this work, we aim to exhaustively study the resilience of Real NVP models against errors due to radiation, providing a means to guide the implementation of fault tolerance measures.
中文: 本研究通过自定义TensorFlow框架注入多种故障类型,评估了基于物理信息的Real NVP模型在卫星故障检测中的表现,重点分析了神经网络对辐射引发错误的抗干扰能力。
English: This study evaluates a physics-informed Real NVP model for detecting faults in satellites, using a custom TensorFlow framework to inject various fault types and assess neural network resilience against radiation-induced errors.

Authors:Yuqi Ye, Li You, Hao Xu, Ahmed Elzanaty, Kai-Kit Wong, Xiqi Gao
Title: SCNR Maximization for MIMO ISAC Assisted by Fluid Antenna System
Abstract:
The integrated sensing and communication (ISAC) technology has been extensively researched to enhance communication rates and radar sensing capabilities. Additionally, a new technology known as fluid antenna system (FAS) has recently been proposed to obtain higher communication rates for future wireless networks by dynamically altering the antenna position to obtain a more favorable channel condition. The application of the FAS technology in ISAC scenarios holds significant research potential. In this paper, we investigate a FAS-assisted multiple-input multiple-output (MIMO) ISAC system for maximizing the radar sensing signal-clutter-noise ratio (SCNR) under communication signal-to-interference-plus-noise ratio (SINR) and antenna position constraints. We devise an iterative algorithm that tackles the optimization problem by maximizing a lower bound of SCNR with respect to the transmit precoding matrix and the antenna position. By addressing the non-convexity of the problem through this iterative approach, our method significantly improves the SCNR. Our simulation results demonstrate that the proposed scheme achieves a higher SCNR compared to the baselines.
Chinese: 本文提出了一种流体天线系统(FAS),通过迭代算法在通信约束下优化发射预编码和天线位置,以提升集成感知与通信(ISAC)系统中的雷达感知信杂噪比(SCNR)。
English: This paper introduces a fluid antenna system (FAS) to enhance the radar sensing signal-clutter-noise ratio (SCNR) in integrated sensing and communication (ISAC) systems, using an iterative algorithm that optimizes transmit precoding and antenna positioning under communication constraints.

Authors:Arash Hajisharifi, Michele Girfoglio, Annalisa Quaini, Gianluigi Rozza
Title: Combining Extended Convolutional Autoencoders and Reservoir Computing for Accurate Reduced-Order Predictions of Atmospheric Flows
Abstract:
Forecasting atmospheric flows with traditional discretization methods, also called full order methods (e.g., finite element methods or finite volume methods), is computationally expensive. We propose to reduce the computational cost with a Reduced Order Model (ROM) that combines Extended Convolutional Autoencoders (E-CAE) and Reservoir Computing (RC). Thanks to an extended network depth, the E-CAE encodes the high-resolution data coming from the full order method into a compact latent representation and can decode it back into high-resolution with 75% lower reconstruction error than standard CAEs. The compressed data are fed to an RC network, which predicts their evolution. The advantage of RC networks is a reduced computational cost in the training phase compared to conventional predictive models. We assess our data-driven ROM through well-known 2D and 3D benchmarks for atmospheric flows. We show that our ROM accurately reconstructs and predicts the future system dynamics with errors below 6% in 2D and 8% in 3D, while significantly reducing the computational cost of a full-order simulation. Compared to other ROMs available in the literature, such as Dynamic Mode Decomposition and Proper Orthogonal Decomposition with Interpolation, our ROM is as efficient but more accurate. Thus, it is a promising alternative to high-dimensional atmospheric simulations.
中文: 本研究提出了一种结合扩展卷积自编码器和储层计算的高效降阶模型,该模型在大气流动预测中误差低于8%,相比传统方法显著降低了计算成本,同时保持了高精度。
English: This study introduces a computationally efficient Reduced Order Model combining Extended Convolutional Autoencoders and Reservoir Computing, which achieves high accuracy with under 8% error in atmospheric flow predictions while significantly reducing computational costs compared to traditional methods.

Authors:Xiongfei Wu, Mingfei Cheng, Qiang Hu, Jianlang Chen, Yuheng Huang, Manabu Okada, Michio Hayashi, Tomoyuki Tsuchiya, Xiaofei Xie, Lei Ma
Title: Foundation Models for Autonomous Driving System: An Initial Roadmap
Abstract:
Recent advancements in Foundation Models (FMs), such as Large Language Models (LLMs), have significantly enhanced Autonomous Driving Systems (ADSs) by improving perception, reasoning, and decision-making in dynamic and uncertain environments. However, ADSs are highly complex cyber-physical systems that demand rigorous software engineering practices to ensure reliability and safety. Integrating FMs into ADSs introduces new challenges in system design and evaluation, requiring a systematic review to establish a clear research roadmap. To unlock these challenges, we present a structured roadmap for integrating FMs into autonomous driving, covering three key aspects: the infrastructure of FMs, their application in autonomous driving systems, and their current applications in practice. For each aspect, we review the current research progress, identify existing challenges, and highlight research gaps that need to be addressed by the community.
中文: 基础模型显著提升了自动驾驶系统的能力,但也带来了系统可靠性和安全性方面的新挑战,需要从基础设施、应用和实践三个层面制定结构化整合路线图。
English: Foundation Models significantly enhance autonomous driving capabilities but introduce new challenges in system reliability and safety, necessitating a structured roadmap for integration across infrastructure, application, and practical implementation.

Authors:Team Cohere, :, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, Walter Beller-Morales, Alexandre Bérard, Andrew Berneshawi, Anna Bialas, Phil Blunsom, Matt Bobkin, Adi Bongale, Sam Braun, Maxime Brunet, Samuel Cahyawijaya, David Cairuz, Jon Ander Campos, Cassie Cao, Kris Cao, Roman Castagné, Julián Cendrero, Leila Chan Currie, Yash Chandak, Diane Chang, Giannis Chatziveroglou, Hongyu Chen, Claire Cheng, Alexis Chevalier, Justin T. Chiu, Eugene Cho, Eugene Choi, Eujeong Choi, Tim Chung, Volkan Cirik, Ana Cismaru, Pierre Clavier, Henry Conklin, Lucas Crawhall-Stein, Devon Crouse, Andres Felipe Cruz-Salinas, Ben Cyrus, Daniel D'souza, Hugo Dalla-Torre, John Dang, William Darling, Omar Darwiche Domingues, Saurabh Dash, Antoine Debugne, Théo Dehaze, Shaan Desai, Joan Devassy, Rishit Dholakia, Kyle Duffy, Ali Edalati, Ace Eldeib, Abdullah Elkady, Sarah Elsharkawy, Irem Ergün, Beyza Ermis, Marzieh Fadaee, Boyu Fan, Lucas Fayoux, Yannis Flet-Berliac, Nick Frosst, Matthias Gallé, Wojciech Galuba, Utsav Garg, Matthieu Geist, Mohammad Gheshlaghi Azar, Ellen Gilsenan-McMahon, Seraphina Goldfarb-Tarrant, Tomas Goldsack, Aidan Gomez, Victor Machado Gonzaga, Nithya Govindarajan, Manoj Govindassamy, Nathan Grinsztajn, Nikolas Gritsch, Patrick Gu, Shangmin Guo, Kilian Haefeli, Rod Hajjar, Tim Hawes, Jingyi He, Sebastian Hofstätter, Sungjin Hong, Sara Hooker, Tom Hosking, Stephanie Howe, Eric Hu, Renjie Huang, Hemant Jain, Ritika Jain, Nick Jakobi, Madeline Jenkins, JJ Jordan, Dhruti Joshi, Jason Jung, Trushant Kalyanpur, Siddhartha Rao Kamalakara, Julia Kedrzycki, Gokce Keskin, Edward Kim, Joon Kim, Wei-Yin Ko, Tom Kocmi, Michael Kozakov, Wojciech Kryściński, Arnav Kumar Jain, Komal Kumar Teru, Sander Land, Michael Lasby, Olivia Lasche, Justin Lee, Patrick Lewis, Jeffrey Li, Jonathan Li, Hangyu Lin, Acyr Locatelli, Kevin Luong, Raymond Ma, Lukáš Mach, Marina Machado, Joanne Magbitang, Brenda Malacara Lopez, Aryan Mann, Kelly Marchisio, Olivia Markham, Alexandre Matton, Alex McKinney, Dominic McLoughlin, Jozef Mokry, Adrien Morisot, Autumn Moulder, Harry Moynehan, Maximilian Mozes, Vivek Muppalla, Lidiya Murakhovska, Hemangani Nagarajan, Alekhya Nandula, Hisham Nasir, Shauna Nehra, Josh Netto-Rosen, Daniel Ohashi, James Owers-Bardsley, Jason Ozuzu, Dennis Padilla, Gloria Park, Sam Passaglia, Jeremy Pekmez, Laura Penstone, Aleksandra Piktus, Case Ploeg, Andrew Poulton, Youran Qi, Shubha Raghvendra, Miguel Ramos, Ekagra Ranjan, Pierre Richemond, Cécile Robert-Michon, Aurélien Rodriguez, Sudip Roy, Sebastian Ruder, Laura Ruis, Louise Rust, Anubhav Sachan, Alejandro Salamanca, Kailash Karthik Saravanakumar, Isha Satyakam, Alice Schoenauer Sebag, Priyanka Sen, Sholeh Sepehri, Preethi Seshadri, Ye Shen, Tom Sherborne, Sylvie Shang Shi, Sanal Shivaprasad, Vladyslav Shmyhlo, Anirudh Shrinivason, Inna Shteinbuk, Amir Shukayev, Mathieu Simard, Ella Snyder, Ava Spataru, Victoria Spooner, Trisha Starostina, Florian Strub, Yixuan Su, Jimin Sun, Dwarak Talupuru, Eugene Tarassov, Elena Tommasone, Jennifer Tracey, Billy Trend, Evren Tumer, Ahmet Üstün, Bharat Venkitesh, David Venuto, Pat Verga, Maxime Voisin, Alex Wang, Donglu Wang, Shijian Wang, Edmond Wen, Naomi White, Jesse Willman, Marysia Winkels, Chen Xia, Jessica Xie, Minjie Xu, Bowen Yang, Tan Yi-Chern, Ivan Zhang, Zhenyu Zhao, Zhoujie Zhao
Title: Command A: An Enterprise-Ready Large Language Model
Abstract:
In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.
Chinese: 本报告介绍了Command A,这是一款专为企业应用优化的高性能多语言大模型,具备先进的检索增强生成能力和混合架构,通过去中心化训练方法开发,并在多项商业任务中展现出卓越性能。
English: This report introduces Command A, a high-performance multilingual large language model optimized for enterprise applications with advanced RAG capabilities and a hybrid architecture, trained through decentralized methods and evaluated across various business tasks.

Authors:Tom Westermann, Malte Ramonat, Johannes Hujer, Felix Gehlhoff, Alexander Fay
Title: Automatic Mapping of AutomationML Files to Ontologies for Graph Queries and Validation
Abstract:
AutomationML has seen widespread adoption as an open data exchange format in the automation domain. It is an open and vendor neutral standard based on the extensible markup language XML. However, AutomationML extends XML with additional semantics that limit the applicability of common XML-tools for applications like querying or data validation. This article demonstrates how the transformation of AutomationML into OWL enables new use cases in querying with SPARQL and validation with SHACL. To support this, it provides practitioners with (1) an up-to-date ontology of the concepts defined in the AutomationML standard and (2) a declarative mapping to automatically transform any AutomationML model into RDF triples. A study on examples from the automation domain concludes that transforming AutomationML to OWL opens up new powerful ways for querying and validation that would have been impossible without this transformation.
中文:本文论证了将AutomationML转换为OWL能够利用SPARQL实现高效查询和SHACL进行数据验证,通过更新的本体和自动映射到RDF三元组,突破了原有XML工具的局限性。
English: This article demonstrates that transforming AutomationML into OWL enables powerful querying with SPARQL and validation with SHACL, overcoming the limitations of XML-based tools through an updated ontology and automated mapping to RDF triples.

Authors:Yu Qian, Xianmin Huang, Ranran Wang, Zeyu Yang, Min Zhou, Thomas Kämpfe, Cheng Zhuo, Xunzhao Yin
Title: Device-Algorithm Co-Design of Ferroelectric Compute-in-Memory In-Situ Annealer for Combinatorial Optimization Problems
Abstract:
Combinatorial optimization problems (COPs) are crucial in many applications but are computationally demanding. Traditional Ising annealers address COPs by directly converting them into Ising models (known as direct-E transformation) and solving them through iterative annealing. However, these approaches require vector-matrix-vector (VMV) multiplications with a complexity of $O(n^2)$ for Ising energy computation and complex exponential annealing factor calculations during annealing process, thus significantly increasing hardware costs. In this work, we propose a ferroelectric compute-in-memory (CiM) in-situ annealer to overcome aforementioned challenges. The proposed device-algorithm co-design framework consists of (i) a novel transformation method (first to our known) that converts COPs into an innovative incremental-E form, which reduces the complexity of VMV multiplication from $O(n^2)$ to $O(n)$, and approximates exponential annealing factor with a much simplified fractional form; (ii) a double gate ferroelectric FET (DG FeFET)-based CiM crossbar that efficiently computes the in-situ incremental-E form by leveraging the unique structure of DG FeFETs; (iii) %When feasible solutions are detected, a CiM annealer that approaches the solutions of COPs via iterative incremental-E computations within a tunable back gate-based in-situ annealing flow. Evaluation results show that our proposed CiM annealer significantly reduces hardware overhead, reducing energy consumption by 1503/1716$\times$ and time cost by 8.08/8.15$\times$ in solving 3000-node Max-Cut problems compared to two state-of-the-art annealers. It also exhibits high solving efficiency, achieving a remarkable average success rate of 98\%, whereas other annealers show only 50\% given the same iteration counts.
中文: 本研究提出了一种铁电存内计算退火器,将组合优化问题转化为增量能量形式,显著降低了计算复杂度和硬件成本,并在解决大规模问题时表现出高效性。
English: This study introduces a ferroelectric compute-in-memory annealer that transforms combinatorial optimization problems into an incremental-E form, reducing computational complexity and hardware costs while achieving high efficiency in solving large-scale problems.

Authors:Yu Qian, Xianmin Huang, Ranran Wang, Zeyu Yang, Min Zhou, Thomas Kämpfe, Cheng Zhuo, Xunzhao Yin
Title: Device-Algorithm Co-Design of Ferroelectric Compute-in-Memory In-Situ Annealer for Combinatorial Optimization Problems
Abstract:
Combinatorial optimization problems (COPs) are crucial in many applications but are computationally demanding. Traditional Ising annealers address COPs by directly converting them into Ising models (known as direct-E transformation) and solving them through iterative annealing. However, these approaches require vector-matrix-vector (VMV) multiplications with a complexity of $O(n^2)$ for Ising energy computation and complex exponential annealing factor calculations during annealing process, thus significantly increasing hardware costs. In this work, we propose a ferroelectric compute-in-memory (CiM) in-situ annealer to overcome aforementioned challenges. The proposed device-algorithm co-design framework consists of (i) a novel transformation method (first to our known) that converts COPs into an innovative incremental-E form, which reduces the complexity of VMV multiplication from $O(n^2)$ to $O(n)$, and approximates exponential annealing factor with a much simplified fractional form; (ii) a double gate ferroelectric FET (DG FeFET)-based CiM crossbar that efficiently computes the in-situ incremental-E form by leveraging the unique structure of DG FeFETs; (iii) %When feasible solutions are detected, a CiM annealer that approaches the solutions of COPs via iterative incremental-E computations within a tunable back gate-based in-situ annealing flow. Evaluation results show that our proposed CiM annealer significantly reduces hardware overhead, reducing energy consumption by 1503/1716$\times$ and time cost by 8.08/8.15$\times$ in solving 3000-node Max-Cut problems compared to two state-of-the-art annealers. It also exhibits high solving efficiency, achieving a remarkable average success rate of 98\%, whereas other annealers show only 50\% given the same iteration counts.
中文: 本研究提出了一种铁电存内计算退火器,将组合优化问题转化为增量能量形式,显著降低了计算复杂度和硬件成本,并在解决大规模问题时表现出高效性。
English: This study introduces a ferroelectric compute-in-memory annealer that transforms combinatorial optimization problems into an incremental-E form, reducing computational complexity and hardware costs while achieving high efficiency in solving large-scale problems.

Authors:Jianbo Gao, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Title: AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection
Abstract:
Recent advancement in large-scale Artificial Intelligence (AI) models offering multimodal services have become foundational in AI systems, making them prime targets for model theft. Existing methods select Out-of-Distribution (OoD) data as backdoor watermarks and retrain the original model for copyright protection. However, existing methods are susceptible to malicious detection and forgery by adversaries, resulting in watermark evasion. In this work, we propose Model-\underline{ag}nostic Black-box Backdoor W\underline{ate}rmarking Framework (AGATE) to address stealthiness and robustness challenges in multimodal model copyright protection. Specifically, we propose an adversarial trigger generation method to generate stealthy adversarial triggers from ordinary dataset, providing visual fidelity while inducing semantic shifts. To alleviate the issue of anomaly detection among model outputs, we propose a post-transform module to correct the model output by narrowing the distance between adversarial trigger image embedding and text embedding. Subsequently, a two-phase watermark verification is proposed to judge whether the current model infringes by comparing the two results with and without the transform module. Consequently, we consistently outperform state-of-the-art methods across five datasets in the downstream tasks of multimodal image-text retrieval and image classification. Additionally, we validated the robustness of AGATE under two adversarial attack scenarios.
中文: 本文提出AGATE框架,通过生成隐蔽对抗性触发器和后处理转换模块,解决多模态AI版权保护中的隐蔽性与鲁棒性挑战,在图像-文本检索和图像分类任务中优于现有方法,并在对抗攻击场景下验证了其稳健性。
English: This paper introduces AGATE, a model-agnostic black-box backdoor watermarking framework that generates stealthy adversarial triggers and employs a post-transform module to enhance robustness and prevent evasion in multimodal AI copyright protection, outperforming existing methods across multiple datasets and attack scenarios.

Authors:Chenyi Sun, Ziting Zhang, Kai Wan, Giuseppe Caire
Title: Multi-Message Secure Aggregation with Demand Privacy
Abstract:
This paper considers a multi-message secure aggregation with privacy problem, in which a server aims to compute $\sf K_c\geq 1$ linear combinations of local inputs from $\sf K$ distributed users. The problem addresses two tasks: (1) security, ensuring that the server can only obtain the desired linear combinations without any else information about the users' inputs, and (2) privacy, preventing users from learning about the server's computation task. In addition, the effect of user dropouts is considered, where at most $\sf{K-U}$ users can drop out and the identity of these users cannot be predicted in advance. We propose two schemes for $\sf K_c$ is equal to (1) and $\sf 2\leq K_c\leq U-1$, respectively. For $\sf K_c$ is equal to (1), we introduce multiplicative encryption of the server's demand using a random variable, where users share coded keys offline and transmit masked models in the first round, followed by aggregated coded keys in the second round for task recovery. For $\sf{2\leq K_c \leq U-1}$, we use robust symmetric private computation to recover linear combinations of keys in the second round. The objective is to minimize the number of symbols sent by each user during the two rounds. Our proposed schemes have achieved the optimal rate region when $ \sf K_c $ is equal to (1) and the order optimal rate (within 2) when $\sf{2\leq K_c \leq U-1}$.
中文: 本文针对分布式用户输入的安全聚合问题,提出了在保护服务器安全与用户隐私的同时计算线性组合的方案,并在不同计算数量下实现了最优通信效率。
English: This paper proposes secure aggregation schemes for computing linear combinations of distributed user inputs while ensuring server security and user privacy, achieving optimal communication rates for different numbers of computations.

Authors:Qitao Tan, Sung-En Chang, Rui Xia, Huidong Ji, Chence Yang, Ci Zhang, Jun Liu, Zheng Zhan, Zhenman Fang, Zhou Zou, Yanzhi Wang, Jin Lu, Geng Yuan
Title: Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training
Abstract:
Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. In this paper, we identify this critical issue, which arises from the mismatch between algorithm and hardware designers. To address this issue, we proposed PeZO, a perturbation-efficient ZO framework. Specifically, we design random number reuse strategies to significantly reduce the demand for random number generation and introduce a hardware-friendly adaptive scaling method to replace the costly Gaussian distribution with a uniform distribution. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6\% and 12.7\%, and saves at maximum 86\% power consumption, all without compromising training performance, making ZO optimization feasible for on-device training. To the best of our knowledge, we are the first to explore the potential of on-device ZO optimization, providing valuable insights for future research.
中文: PeZO框架通过减少随机数生成需求和用均匀分布替代高斯分布,解决了零阶优化在硬件上的效率问题,使设备端训练变得可行且不损失性能。
English: The PeZO framework addresses the hardware inefficiency of zeroth-order optimization by reducing random number generation demands and replacing Gaussian with uniform distributions, enabling feasible on-device training without performance loss.

Authors:Niki van Stein, Anna V. Kononova, Haoran Yin, Thomas Bäck
Title: BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics
Abstract:
The application of Large Language Models (LLMs) for Automated Algorithm Discovery (AAD), particularly for optimisation heuristics, is an emerging field of research. This emergence necessitates robust, standardised benchmarking practices to rigorously evaluate the capabilities and limitations of LLM-driven AAD methods and the resulting generated algorithms, especially given the opacity of their design process and known issues with existing benchmarks. To address this need, we introduce BLADE (Benchmark suite for LLM-driven Automated Design and Evolution), a modular and extensible framework specifically designed for benchmarking LLM-driven AAD methods in a continuous black-box optimisation context. BLADE integrates collections of benchmark problems (including MA-BBOB and SBOX-COST among others) with instance generators and textual descriptions aimed at capability-focused testing, such as generalisation, specialisation and information exploitation. It offers flexible experimental setup options, standardised logging for reproducibility and fair comparison, incorporates methods for analysing the AAD process (e.g., Code Evolution Graphs and various visualisation approaches) and facilitates comparison against human-designed baselines through integration with established tools like IOHanalyser and IOHexplainer. BLADE provides an `out-of-the-box' solution to systematically evaluate LLM-driven AAD approaches. The framework is demonstrated through two distinct use cases exploring mutation prompt strategies and function specialisation.
Chinese: BLADE作为一个模块化、可扩展的基准测试框架,旨在系统评估大语言模型驱动的自动化算法发现方法,通过集成基准问题与实验工具,解决当前缺乏标准化评估体系的问题。
English: BLADE is introduced as a modular and extensible benchmarking framework designed to systematically evaluate Large Language Model-driven Automated Algorithm Discovery methods in continuous black-box optimization, addressing the need for standardized assessment of their capabilities and limitations.

Authors:Chengzhi Wu, Yuxin Wan, Hao Fu, Julius Pfrommer, Zeyun Zhong, Junwei Zheng, Jiaming Zhang, Jürgen Beyerer
Title: SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
Abstract:
Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.
中文: 点云采样研究提出SAMBLE方法,通过学习形状特定的采样策略,有效平衡局部细节与全局均匀性,在多种下游任务中实现优越性能,即使采样点稀少。
English: Point cloud sampling is advancing with learning-based methods like SAMBLE, which adapts to shape-specific distributions to balance local detail and global uniformity, enhancing performance in downstream tasks even with sparse sampling.

Authors:Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, Xin Liu
Title: Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
Abstract:
In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage of existing optimizations from different frameworks. First, we integrate communication primitives compliant with the OpenSHMEM standard into the compiler. This enables programmers to utilize these primitives with a higher-level Python programming model. Second, we illustrate how to achieve complex joint optimization of computation, memory access, and communication with the assistance of the compiler. In particular, we show how to use overlapping techniques to hide latency and present our compiler-based programming methods in both single-node and multi-node scenarios. Finally, we showcase the performance of the code generated by our compiler. In a test environment with up to 64 devices, our compiler can fully utilize heterogeneous communication and computation resources to provide effective overlapping and high performance. In many cases, the performance of the generated code can even outperform hand-optimized code. Moreover, the development difficulty and the time cost for development using our compiler are far less than those of low-level programming such as CUDA/C++, which clearly demonstrates significant productivity advantages.
中文: Triton-distributed 是一个编译器扩展,支持分布式AI工作负载的原生重叠优化,通过集成OpenSHMEM通信原语和Python编程模型,在性能和开发效率上显著优于底层编程方法。
English: Triton-distributed is a compiler extension that enables native overlapping optimizations for distributed AI workloads, integrating OpenSHMEM communication primitives with a Python programming model to achieve high performance and productivity advantages over low-level programming.

Authors:Kangbiao Shi, Yixu Feng, Tao Hu, Yu Cao, Peng Wu, Yijin Liang, Yanning Zhang, Qingsen Yan
Title: FusionNet: Multi-model Linear Fusion Framework for Low-light Image Enhancement
Abstract:
The advent of Deep Neural Networks (DNNs) has driven remarkable progress in low-light image enhancement (LLIE), with diverse architectures (e.g., CNNs and Transformers) and color spaces (e.g., sRGB, HSV, HVI) yielding impressive results. Recent efforts have sought to leverage the complementary strengths of these paradigms, offering promising solutions to enhance performance across varying degradation scenarios. However, existing fusion strategies are hindered by challenges such as parameter explosion, optimization instability, and feature misalignment, limiting further improvements. To overcome these issues, we introduce FusionNet, a novel multi-model linear fusion framework that operates in parallel to effectively capture global and local features across diverse color spaces. By incorporating a linear fusion strategy underpinned by Hilbert space theoretical guarantees, FusionNet mitigates network collapse and reduces excessive training costs. Our method achieved 1st place in the CVPR2025 NTIRE Low Light Enhancement Challenge. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods in terms of both quantitative and qualitative results, delivering robust enhancement under diverse low-light conditions.
中文: FusionNet提出了一种多模型线性融合框架,能有效整合不同色彩空间的全局与局部特征,在低光图像增强中取得领先性能,同时解决了参数爆炸和优化不稳定等问题。
English: FusionNet introduces a multi-model linear fusion framework that effectively integrates global and local features across color spaces, achieving top performance in low-light image enhancement while overcoming issues like parameter explosion and optimization instability.

Authors:Deeksha Varshney, Keane Ong, Rui Mao, Erik Cambria, Gianmarco Mengaldo
Title: ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics
Abstract:
Accurate assessments of extreme weather events are vital for research and policy, yet localized and granular data remain scarce in many parts of the world. This data gap limits our ability to analyze potential outcomes and implications of extreme weather events, hindering effective decision-making. Large Language Models (LLMs) can process vast amounts of unstructured text data, extract meaningful insights, and generate detailed assessments by synthesizing information from multiple sources. Furthermore, LLMs can seamlessly transfer their general language understanding to smaller models, enabling these models to retain key knowledge while being fine-tuned for specific tasks. In this paper, we propose Extreme Weather Reasoning-Aware Alignment (EWRA), a method that enhances small language models (SLMs) by incorporating structured reasoning paths derived from LLMs, and ExtremeWeatherNews, a large dataset of extreme weather event-related news articles. EWRA and ExtremeWeatherNews together form the overall framework, ClimaEmpact, that focuses on addressing three critical extreme-weather tasks: categorization of tangible vulnerabilities/impacts, topic labeling, and emotion analysis. By aligning SLMs with advanced reasoning strategies on ExtremeWeatherNews (and its derived dataset ExtremeAlign used specifically for SLM alignment), EWRA improves the SLMs' ability to generate well-grounded and domain-specific responses for extreme weather analytics. Our results show that the approach proposed guides SLMs to output domain-aligned responses, surpassing the performance of task-specific models and offering enhanced real-world applicability for extreme weather analytics.
中文:通过结构化推理和专门数据集,大语言模型能提升小语言模型在极端天气分析中的表现,其性能超越特定任务模型并具有更强的实际应用价值。
English: LLMs can enhance small language models through structured reasoning and a specialized dataset to improve extreme weather analytics, surpassing task-specific models in performance and real-world applicability.

Authors:Feng Chen, Yefei He, Lequan Lin, Jing Liu, Bohan Zhuang, Qi Wu
Title: ZipR1: Reinforcing Token Sparsity in MLLMs
Abstract:
Sparse attention mechanisms aim to reduce computational overhead by selectively processing a subset of salient tokens while preserving model performance. Despite the effectiveness of such designs, how to actively encourage token sparsity of well-posed MLLMs remains under-explored, which fundamentally limits the achievable acceleration effect during inference. In this paper, we propose a simple RL-based post-training method named \textbf{ZipR1} that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward. In this way, our method can jointly alleviate the computation and memory bottlenecks via directly optimizing the inference-consistent efficiency-performance tradeoff. Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80\% to 25\% with a minimal accuracy reduction on 13 image and video benchmarks.
Chinese: Sparsity Forcing框架通过强化学习增强多模态大语言模型的令牌稀疏性,在保持精度的同时将令牌削减率提升至75%,显著提高了推理速度并降低了内存消耗。
English: The Sparsity Forcing framework enhances token sparsity in multimodal large language models using reinforcement learning to optimize efficiency and accuracy, achieving up to 75% token reduction with minimal performance loss and significant improvements in inference speed and memory usage.

Authors:Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu
Title: Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Abstract:
Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
Chinese: Sparsity Forcing框架通过强化学习增强多模态大语言模型的令牌稀疏性,在保持精度的同时将令牌削减率提升至75%,显著提高了推理速度并降低了内存消耗。
English: The Sparsity Forcing framework enhances token sparsity in multimodal large language models using reinforcement learning to optimize efficiency and accuracy, achieving up to 75% token reduction with minimal performance loss and significant improvements in inference speed and memory usage.

Authors:Atsushi Kanehira, Naoki Wake, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
Title: RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping
Abstract:
This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models.
本研究提出了一种基于强化学习的数据增强方法,通过生成丰富的接触模拟数据来改进灵巧抓取的视觉-动作模型,从而提升对未知物体的泛化能力并缓解数据不足问题。
This study introduces a reinforcement learning-based data augmentation method to enhance vision-action models for dexterous grasping by generating contact-rich simulation data, improving generalization to unseen objects and addressing data scarcity.

Authors:Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai
Title: Representation Learning via Non-Contrastive Mutual Information
Abstract:
Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.
中文: 本文提出了一种名为互信息非对比(MINC)损失的新自监督学习目标,它通过融合对比和非对比方法的优势,在降低方差的同时防止模型坍塌,并在ImageNet数据集上展现出优于基线方法的性能。
English: This paper introduces the Mutual Information Non-Contrastive (MINC) loss, a novel self-supervised learning objective that combines the strengths of contrastive and non-contrastive methods by reducing variance while preventing model collapse, demonstrating improved performance over baseline methods on ImageNet.

Authors:Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly
Title: Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion
Abstract:
Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including fine-tuning using reward functions or preference data, and distillation of knowledge from pre-trained autoregressive models. These new capabilities stem from the core idea of TCSM, estimating the concrete score of the target distribution, which resides in the original (clean) data space. This allows seamless integration with reward functions and pre-trained models, which inherently only operate in the clean data space rather than the noisy intermediate spaces of diffusion processes. Our experiments on language modeling tasks demonstrate that TCSM matches or surpasses current methods. Additionally, TCSM is versatile, applicable to both pre-training and post-training scenarios, offering greater flexibility and sample efficiency.
Chinese: 目标具体分数匹配(TCSM)是一种用于训练和微调离散扩散模型的通用目标,能够与奖励函数和预训练模型无缝集成,并在语言建模任务中展现出卓越的性能和灵活性。
English: Target Concrete Score Matching (TCSM) is a versatile objective for training and fine-tuning discrete diffusion models, enabling seamless integration with reward functions and pre-trained models while demonstrating superior performance and flexibility in language modeling tasks.

Authors:Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin
Title: Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Abstract:
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
Chinese: 检索增强生成(RAG)通过引用源文档减少大语言模型的幻觉,一项对比GPT-4o与人类评估者的研究表明,自动支持评估可作为可靠替代方案,在后编辑条件下一致性高达72%。
English: Retrieval-augmented generation (RAG) enhances large language models by allowing them to cite source documents to reduce hallucinations, and a study comparing GPT-4o with human judges found that automated support assessment can be a reliable alternative, achieving up to 72% agreement in post-editing conditions.

Authors:Yulong Li, Zhixiang Lu, Feilong Tang, Simin Lai, Ming Hu, Yuxuan Zhang, Haochen Xue, Zhaodong Wu, Imran Razzak, Qingxia Li, Jionglong Su
Title: Rhythm of Opinion: A Hawkes-Graph Framework for Dynamic Propagation Analysis
Abstract:
The rapid development of social media has significantly reshaped the dynamics of public opinion, resulting in complex interactions that traditional models fail to effectively capture. To address this challenge, we propose an innovative approach that integrates multi-dimensional Hawkes processes with Graph Neural Network, modeling opinion propagation dynamics among nodes in a social network while considering the intricate hierarchical relationships between comments. The extended multi-dimensional Hawkes process captures the hierarchical structure, multi-dimensional interactions, and mutual influences across different topics, forming a complex propagation network. Moreover, recognizing the lack of high-quality datasets capable of comprehensively capturing the evolution of public opinion dynamics, we introduce a new dataset, VISTA. It includes 159 trending topics, corresponding to 47,207 posts, 327,015 second-level comments, and 29,578 third-level comments, covering diverse domains such as politics, entertainment, sports, health, and medicine. The dataset is annotated with detailed sentiment labels across 11 categories and clearly defined hierarchical relationships. When combined with our method, it offers strong interpretability by linking sentiment propagation to the comment hierarchy and temporal evolution. Our approach provides a robust baseline for future research.
Chinese: 本研究提出了一种创新模型,将多维霍克斯过程与图神经网络相结合,以捕捉社交网络中复杂的舆论动态,并辅以新开发的VISTA数据集,该数据集提供了全面的情感标注和层级关系标注。
English: This study introduces an innovative model combining multi-dimensional Hawkes processes with Graph Neural Networks to capture complex public opinion dynamics in social networks, supported by a newly developed dataset called VISTA that provides comprehensive sentiment and hierarchical annotations.

Authors:Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin
Title: The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Abstract:
Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.
中文: 本研究提出了一种基于TREC评估方法的自动框架,通过自动化生成和分配信息要点来评估检索增强生成系统,在保证与人工评估高度一致的同时实现了效率与质量的平衡。
English: This work introduces an automatic evaluation framework for retrieval-augmented generation (RAG) systems, leveraging the nugget methodology from TREC to achieve strong agreement with human-based evaluations while offering practical tradeoffs between effort and quality.

Authors:Gabriela Ben Melech Stan, Estelle Aflalo, Avinash Madasu, Vasudev Lal, Phillip Howard
Title: Learning from Reasoning Failures via Synthetic Data Generation
Abstract:
Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of high-quality paired image-text data compared to language-only data. While a variety of methods have been proposed for generating large multimodal datasets, they do not tailor the synthetic data to address specific deficiencies in the reasoning abilities of LMMs which will be trained with the generated dataset. In contrast, humans often learn in a more efficient manner by seeking out examples related to the types of reasoning where they have failed previously. Inspired by this observation, we propose a new approach for synthetic data generation which is grounded in the analysis of an existing LMM's reasoning failures. Our methodology leverages frontier models to automatically analyze errors produced by a weaker LMM and propose new examples which can be used to correct the reasoning failure via additional training, which are then further filtered to ensure high quality. We generate a large multimodal instruction tuning dataset containing over 553k examples using our approach and conduct extensive experiments demonstrating its utility for improving the performance of LMMs on multiple downstream tasks. Our results show that models trained on our synthetic data can even exceed the performance of LMMs trained on an equivalent amount of additional real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs. We will make our dataset and code publicly available.
中文: 本文提出了一种通过分析现有大型多模态模型的推理失败来生成针对性合成数据的新方法,有效提升了模型性能,其效果甚至优于使用等量真实数据的训练结果。
English: This paper introduces a novel approach to generating synthetic multimodal data by analyzing reasoning failures in existing large multimodal models, which effectively enhances model performance and even surpasses training with equivalent real data.

Authors:Megan Gu, Chloe Qianhui Zhao, Claire Liu, Nikhil Patel, Jahnvi Shah, Jionghao Lin, Kenneth R. Koedinger
Title: Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation
Abstract:
Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.
本研究利用GPT-3.5自动评估五种关键辅导策略,结果显示模型虽能有效排除错误分类,但在准确识别正确策略方面一致性不足,其中帮助学生应对不平等策略表现最佳。
Our study employs GPT-3.5 to automatically evaluate five key tutoring strategies, showing strong capability in avoiding incorrect classifications but limited consistency in accurately identifying correct strategies, with the highest performance observed in helping students manage inequity.

Authors:Ali Agha, Kyohei Otsu, Benjamin Morrell, David D. Fan, Sung-Kyun Kim, Muhammad Fadhil Ginting, Xianmei Lei, Jeffrey Edlund, Seyed Fakoorian, Amanda Bouman, Fernando Chavez, Taeyeon Kim, Gustavo J. Correa, Maira Saboia, Angel Santamaria-Navarro, Brett Lopez, Boseong Kim, Chanyoung Jung, Mamoru Sobue, Oriana Claudia Peltzer, Joshua Ott, Robert Trybula, Thomas Touma, Marcel Kaufmann, Tiago Stegun Vaquero, Torkom Pailevanian, Matteo Palieri, Yun Chang, Andrzej Reinke, Matthew Anderson, Frederik E. T. Schöller, Patrick Spieler, Lillian M. Clark, Avak Archanian, Kenny Chen, Hovhannes Melikyan, Anushri Dixit, Harrison Delecki, Daniel Pastor, Barry Ridge, Nicolas Marchal, Jose Uribe, Sharmita Dey, Kamak Ebadi, Kyle Coble, Alexander Nikitas Dimopoulos, Vivek Thangavelu, Vivek S. Varadharajan, Nicholas Palomo, Antoni Rosinol, Arghya Chatterjee, Christoforos Kanellakis, Bjorn Lindqvist, Micah Corah, Kyle Strickland, Ryan Stonebraker, Michael Milano, Christopher E. Denniston, Sami Sahnoune, Thomas Claudet, Seungwook Lee, Gautam Salhotra, Edward Terry, Rithvik Musuku, Robin Schmid, Tony Tran, Ara Kourchians, Justin Schachter, Hector Azpurua, Levi Resende, Arash Kalantari, Jeremy Nash, Josh Lee, Christopher Patterson, Jennifer G. Blank, Kartik Patath, Yuki Kubo, Ryan Alimo, Yasin Almalioglu, Aaron Curtis, Jacqueline Sly, Tesla Wells, Nhut T. Ho, Mykel Kochenderfer, Giovanni Beltrame, George Nikolakopoulos, David Shim, Luca Carlone, Joel Burdick
Title: An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments
Abstract:
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.
本文扩展了NeBula自主系统框架,通过硬件、软件和算法增强提升了大规模地下环境的探索能力,并在DARPA地下挑战赛中进行了验证。
This paper extends the NeBula autonomy framework with hardware, software, and algorithmic enhancements to improve exploration capabilities in large-scale underground environments, as demonstrated in the DARPA Subterranean Challenge.

Authors:Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov
Title: FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
Abstract:
We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.
中文:FreshStack是一个综合性框架,通过从技术文档自动生成问题并采用混合检索方法构建具有挑战性的信息检索评估基准,揭示了现有模型的显著性能差距,并为提升信息检索和RAG系统提供了改进空间。
English: FreshStack is a comprehensive framework that automatically creates challenging IR evaluation benchmarks by generating questions from technical sources and employing hybrid retrieval methods, demonstrating significant performance gaps for existing models and potential for improving IR and RAG systems.

Authors:Mingwei Li, Pu Pang, Hehe Fan, Hua Huang, Yi Yang
Title: TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors
Abstract:
Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $α$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $α$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset are available at https://longxiang-ai.github.io/TSGS/.
Chinese: TSGS提出了一种双阶段框架,将几何学习与外观建模分离,解决了3D高斯溅射中的透明-深度困境,在TransLab数据集上验证了其能同时实现透明表面的精确几何重建和逼真渲染。
English: TSGS introduces a dual-stage framework that separates geometry learning from appearance modeling to overcome the transparency-depth dilemma in 3D Gaussian Splatting, achieving superior geometric accuracy and photorealistic rendering for transparent surfaces as validated on the TransLab dataset.

Authors:Yutong Xia, Ao Qu, Yunhan Zheng, Yihong Tang, Dingyi Zhuang, Yuxuan Liang, Shenhao Wang, Cathy Wu, Lijun Sun, Roger Zimmermann, Jinhua Zhao
Title: Reimagining Urban Science: Scaling Causal Inference with Large Language Models
Abstract:
Urban causal research is essential for understanding the complex, dynamic processes that shape cities and for informing evidence-based policies. However, current practices are often constrained by inefficient and biased hypothesis formulation, challenges in integrating multimodal data, and fragile experimental methodologies. Imagine a system that automatically estimates the causal impact of congestion pricing on commute times by income group or measures how new green spaces affect asthma rates across neighborhoods using satellite imagery and health reports, and then generates comprehensive, policy-ready outputs, including causal estimates, subgroup analyses, and actionable recommendations. In this Perspective, we propose UrbanCIA, an LLM-driven conceptual framework composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy insights. We begin by examining the current landscape of urban causal research through a structured taxonomy of research topics, data sources, and methodological approaches, revealing systemic limitations across the workflow. Next, we introduce the design principles and technological roadmap for the four modules in the proposed framework. We also propose evaluation criteria to assess the rigor and transparency of these AI-augmented processes. Finally, we reflect on the broader implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces LLM-driven tools as catalysts for more scalable, reproducible, and inclusive urban research.
中文总结:UrbanCIA是一个基于大语言模型的框架,通过四个模块化智能体自动完成假设生成、数据整合、实验执行和政策解读,旨在克服当前城市因果研究的局限,推动城市研究向更可扩展和公平的方向发展。
English Summary: UrbanCIA is an LLM-driven framework designed to overcome current limitations in urban causal research by automating hypothesis generation, data integration, experiment execution, and policy interpretation through four modular agents, aiming to make urban studies more scalable and equitable.

Authors:Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos
Title: A Graph-Based Reinforcement Learning Approach with Frontier Potential Based Reward for Safe Cluttered Environment Exploration
Abstract:
Autonomous exploration of cluttered environments requires efficient exploration strategies that guarantee safety against potential collisions with unknown random obstacles. This paper presents a novel approach combining a graph neural network-based exploration greedy policy with a safety shield to ensure safe navigation goal selection. The network is trained using reinforcement learning and the proximal policy optimization algorithm to maximize exploration efficiency while reducing the safety shield interventions. However, if the policy selects an infeasible action, the safety shield intervenes to choose the best feasible alternative, ensuring system consistency. Moreover, this paper proposes a reward function that includes a potential field based on the agent's proximity to unexplored regions and the expected information gain from reaching them. Overall, the approach investigated in this paper merges the benefits of the adaptability of reinforcement learning-driven exploration policies and the guarantee ensured by explicit safety mechanisms. Extensive evaluations in simulated environments demonstrate that the approach enables efficient and safe exploration in cluttered environments.
中文: 本文提出了一种结合图神经网络探索策略与安全防护的新方法,通过强化学习训练,在杂乱环境中实现高效自主探索,并在策略选择不可行动作时由安全机制干预确保导航安全。
English: This paper introduces a reinforcement learning-based exploration strategy using a graph neural network and a safety shield to ensure efficient and safe navigation in cluttered environments by maximizing exploration while preventing collisions through intervention when necessary.

Authors:Keke Gai, Ziyue Shen, Jing Yu, Liehuang Zhu, Qi Wu
Title: PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility
Abstract:
With the growing demand for protecting the intellectual property (IP) of text-to-image diffusion models, we propose PCDiff -- a proactive access control framework that redefines model authorization by regulating generation quality. At its core, PCDIFF integrates a trainable fuser module and hierarchical authentication layers into the decoder architecture, ensuring that only users with valid encrypted credentials can generate high-fidelity images. In the absence of valid keys, the system deliberately degrades output quality, effectively preventing unauthorized exploitation.Importantly, while the primary mechanism enforces active access control through architectural intervention, its decoupled design retains compatibility with existing watermarking techniques. This satisfies the need of model owners to actively control model ownership while preserving the traceability capabilities provided by traditional watermarking approaches.Extensive experimental evaluations confirm a strong dependency between credential verification and image quality across various attack scenarios. Moreover, when combined with typical post-processing operations, PCDIFF demonstrates powerful performance alongside conventional watermarking methods. This work shifts the paradigm from passive detection to proactive enforcement of authorization, laying the groundwork for IP management of diffusion models.
中文: PCDiff是一种主动访问控制框架,通过将基于凭证的认证集成到解码器中保护文本到图像扩散模型,确保仅授权用户能生成高质量图像,而对未授权访问则刻意降低输出质量。
English: PCDiff is a proactive access control framework that safeguards text-to-image diffusion models by integrating credential-based authentication into the decoder, ensuring only authorized users can produce high-quality images while deliberately degrading outputs for unauthorized access.

Authors:Chang Yang, Ruiyu Wang, Junzhe Jiang, Qi Jiang, Qinggang Zhang, Yanchen Deng, Shuxin Li, Shuyue Hu, Bo Li, Florian T. Pokorny, Xiao Huang, Xinrun Wang
Title: Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs
Abstract:
Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).
中文: 本文提出了非确定性多项式时间问题挑战(NPPC),作为一种持续扩展的基准测试,旨在不可破解且难以攻克,用于评估大语言模型的推理能力,实验表明它能将先进模型的性能显著降至10%以下。
English: This paper introduces the Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling benchmark designed to be uncrushable and unhackable for evaluating large language models' reasoning capabilities, with experiments showing it significantly reduces advanced models' performance below 10%.

Authors:Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma
Title: MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique
Abstract:
Visual language models (VLMs) have demonstrated strong performance across diverse multimodal reasoning tasks but still face challenges such as hallucinations, resulting in incorrect reasoning outcomes. Inspired by recent research on external feedback mechanisms in large language models (LLMs), we propose a multimodal actor-critic framework to enhance VLM reasoning capabilities. Specifically, the actor model generates step-by-step reasoning paths based on image and text inputs, while the critic model evaluates these reasoning paths and provides corrective feedback. The actor model iteratively refines its reasoning based on the feedback until the reasoning outcome is deemed satisfactory by the critic model. To reduce reliance on costly manual annotations, we introduce an automated method for constructing multimodal critique datasets. By leveraging Monte Carlo Tree Search (MCTS), we systematically guide the actor model to explore diverse reasoning paths. To obtain critique data for correcting erroneous reasoning steps, we prompt an annotator model to compare pairs of reasoning paths diverging from a shared ancestor node - one leading to a correct conclusion and the other to an incorrect one. This approach enables us to construct the MMC (MCTS-based Multimodal Critique) dataset, upon which we further develop a comprehensive training and inference pipeline. Extensive experiments conducted on several public benchmark datasets and mainstream VLMs demonstrate that our approach significantly improves the performance of VLM on complex multimodal reasoning tasks, underscoring its effectiveness and wide applicability.
Chinese: 本研究提出了一种多模态演员-评论家框架,通过蒙特卡洛树搜索实现迭代反馈和自动数据集构建,显著提升了视觉语言模型在复杂推理任务中的性能表现。
English: This study introduces a multimodal actor-critic framework that enhances visual language model reasoning through iterative feedback and automated dataset construction using Monte Carlo Tree Search, significantly improving performance on complex tasks.

Authors:Yan Rong, Shan Yang, Chenxing Li, Dong Yu, Li Liu
Title: Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
Abstract:
Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training-free multi-agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human-like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow-based, context-aware framework for diverse audio generation with word-level semantic and temporal alignment. To enhance expressiveness, we then design word-level paralinguistic augmentation, utterance-level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.
中文: Dopamine Audiobook是一种无需训练的多智能体系统,通过多模态大语言模型生成情感丰富、沉浸式的有声书,实现了精确的音频对齐,并采用创新评估框架确保与人类偏好一致。
English: Dopamine Audiobook is a training-free multi-agent system that uses a multimodal large language model to generate emotionally expressive and immersive audiobooks with precise audio alignment and an innovative evaluation framework aligned with human preferences.

Authors:Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, Ashwinee Panda
Title: Analysis of Attention in Video Diffusion Transformers
Abstract:
We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.
中文摘要:本研究揭示了视频扩散变换器中注意力的三个关键特性:结构一致性可实现视频编辑、稀疏模式变化限制通用稀疏化、以及独特的注意力汇聚行为,为优化其效率与质量的平衡提供了新方向。
English Summary: This study reveals three key properties of attention mechanisms in video diffusion transformers—structural consistency enabling video editing, variable sparsity patterns limiting universal sparsification, and unique attention sink behaviors—providing insights to enhance their efficiency-quality trade-offs.

Authors:Pengfei Hu, Zhenrong Zhang, Qikai Chang, Shuhang Liu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma, Qingfeng Liu
Title: PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search
Abstract:
Recent work increasingly focuses on improving the reasoning capabilities of Multimodal Large Language Models (MLLMs). Among existing methods, Process Reward Models (PRMs) stand out for offering dense, step-wise supervision to guide intermediate reasoning. However, how to effectively integrate PRMs into search strategies remains an open question. In this paper, we introduce PRM-BAS (PRM-Guided Beam Annealing Search), a lightweight approach for PRM-guided reasoning that dynamically adjusts beam size -- starting with a broader search space and gradually narrowing it as contextual information accumulates, thereby balancing performance and efficiency. We further propose a unified framework for data construction and PRM training. Specifically, we construct the PRM-BAS-300k dataset by selecting 300k questions from existing datasets and performing rollouts at each step to estimate the probability of reaching a correct final answer. The PRM is then trained using a combination of value loss for absolute action quality and rank loss for relative action quality. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate that PRM-BAS significantly improves reasoning performance while maintaining low computational cost. Moreover, it generalizes well across different model scales and architectures, showcasing strong robustness and plug-and-play capability.
中文: 本文提出了PRM-BAS方法,通过动态调整束宽来平衡多模态推理的性能与效率,并构建新数据集和结合价值与排序损失的训练框架,显著提升了推理效果。
English: This paper introduces PRM-BAS, a lightweight method that dynamically adjusts beam size during multimodal reasoning to enhance performance efficiently, supported by a new dataset and training approach combining value and rank losses.

Authors:Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang
Title: Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Abstract:
Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
中文摘要:Mavors是一种新颖的多粒度视频表示框架,通过结合高分辨率空间编码和时间连贯性建模,在保持计算效率的同时有效保留细粒度时空细节,显著提升了多模态大语言模型对长视频的理解能力。
English Summary: Mavors is a novel framework that enhances long-context video understanding in MLLMs by employing multi-granularity representations, combining high-resolution spatial encoding with temporal coherence modeling to preserve fine-grained spatio-temporal details while maintaining computational efficiency.

Authors:Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu
Title: OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Abstract:
Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. Under multisource preprocessing, two fundamental challenges exist. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present Omniload, an industrial-grade distributed data loading architecture for LFMs, with four innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for elastic multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. (4) Shadow loaders with differential checkpointing for fault recovery without workflow interruption. Deployed on production clusters scaling to multi-thousand GPUs, Omniload achieves: (1) 4.5x end-to-end training throughput improvement, (2) 13.5x reduction in CPU memory usage.
中文: Omniload是一种分布式数据加载架构,通过分离式预处理和集中式编排解决了大型基础模型训练中的负载不均和内存冗余问题,显著提升了训练吞吐量和内存使用效率。
English: Omniload is a distributed data loading architecture that addresses workload imbalance and memory redundancy in large foundation model training through disaggregated preprocessing and centralized orchestration, achieving significant throughput and memory efficiency improvements.

Authors:Xiang Hu, Pingping Zhang, Yuhao Wang, Bin Yan, Huchuan Lu
Title: SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification
Abstract:
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.
中文摘要:本文提出SD-ReID双阶段框架,利用稳定扩散模型生成跨视角特征,解决空中-地面行人重识别中视角变化难题,并通过视图优化解码器提升多视角特征表示能力。
English Summary: The paper introduces SD-ReID, a two-stage framework leveraging Stable Diffusion to generate view-specific features for aerial-ground person re-identification, addressing previous limitations in handling viewpoint variations and enhancing retrieval accuracy.

Authors:Zeyan Li, Jie Song, Tieying Zhang, Tao Yang, Xiongjun Ou, Yingjie Ye, Pengfei Duan, Muchen Lin, Jianjun Chen
Title: Adaptive and Efficient Log Parsing as a Cloud Service
Abstract:
Logs are a critical data source for cloud systems, enabling advanced features like monitoring, alerting, and root cause analysis. However, the massive scale and diverse formats of unstructured logs pose challenges for adaptable, efficient, and accurate parsing methods. This paper introduces ByteBrain-LogParser, an innovative log parsing framework designed specifically for cloud environments. ByteBrain-LogParser employs a hierarchical clustering algorithm to allow real-time precision adjustments, coupled with optimizations such as positional similarity distance, deduplication, and hash encoding to enhance performance. Experiments on large-scale datasets show that it processes 229,000 logs per second on average, achieving an 840% speedup over the fastest baseline while maintaining accuracy comparable to state-of-the-art methods. Real-world evaluations further validate its efficiency and adaptability, demonstrating its potential as a robust cloud-based log parsing solution.
中文: 本文提出的ByteBrain-LogParser框架采用分层聚类算法,实现了实时精度调节,每秒可处理22.9万条日志,速度比基线提升840%,同时保持高精度,为云端日志解析提供高效解决方案。
English: This paper presents ByteBrain-LogParser, a hierarchical clustering-based framework that achieves real-time precision adjustment and processes 229,000 logs per second with an 840% speedup over baselines while maintaining high accuracy for cloud log parsing.

Authors:Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu
Title: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Abstract:
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
中文摘要:GigaTok通过语义正则化将分词器特征与预训练视觉编码器对齐,解决了视觉分词器扩展中重建与生成的矛盾,在30亿参数规模下实现了重建、生成和表征学习的同步提升。
English Summary: GigaTok introduces semantic regularization to align tokenizer features with pre-trained visual encoders, enabling scalable visual tokenizers that simultaneously enhance image reconstruction, generation, and representation learning while achieving state-of-the-art performance at 3 billion parameters.

Authors:Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang
Title: Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Abstract:
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
中文: 本技术报告提出了Seaweed-7B视频生成基础模型,该7B参数模型仅使用66.5万H100 GPU小时训练,通过关键设计决策实现了与更大模型相媲美的性能,并展现出优秀的泛化能力和下游应用适应性。
English: This report introduces Seaweed-7B, a cost-effective video generation model with 7 billion parameters that achieves competitive performance against larger models through optimized design choices despite using only 665,000 H100 GPU hours for training.

Authors:Fengrui Liu, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Lihua Yi, Haipeng Zhang, Gang Wu, Rui Shi
Title: TickIt: Leveraging Large Language Models for Automated Ticket Escalation
Abstract:
In large-scale cloud service systems, support tickets serve as a critical mechanism for resolving customer issues and maintaining service quality. However, traditional manual ticket escalation processes encounter significant challenges, including inefficiency, inaccuracy, and difficulty in handling the high volume and complexity of tickets. While previous research has proposed various machine learning models for ticket classification, these approaches often overlook the practical demands of real-world escalations, such as dynamic ticket updates, topic-specific routing, and the analysis of ticket relationships. To bridge this gap, this paper introduces TickIt, an innovative online ticket escalation framework powered by Large Language Models. TickIt enables topic-aware, dynamic, and relationship-driven ticket escalations by continuously updating ticket states, assigning tickets to the most appropriate support teams, exploring ticket correlations, and leveraging category-guided supervised fine-tuning to continuously improve its performance. By deploying TickIt in ByteDance's cloud service platform Volcano Engine, we validate its efficacy and practicality, marking a significant advancement in the field of automated ticket escalation for large-scale cloud service systems.
中文: 本文提出TickIt,一种基于大型语言模型的创新在线工单升级框架,通过动态更新工单状态、智能分配支持团队及分析工单关联,实现了主题感知和关系驱动的自动化升级,有效解决了传统人工处理及既有机器学习方法在大规模云服务系统中的不足。
English: This paper introduces TickIt, an innovative online ticket escalation framework using Large Language Models to enable dynamic, topic-aware, and relationship-driven escalations, addressing the limitations of traditional manual processes and previous machine learning approaches in large-scale cloud service systems.

Authors:Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, Christian Theobalt
Title: Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
Abstract:
This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.
Chinese Summary: 本研究提出Ego4o框架,通过整合可穿戴设备的多模态输入数据,在部分数据可用时仍能保持稳定性能,并利用运动描述提升精度,实现对人体运动的同步捕捉与理解。
English Summary: This study introduces Ego4o, a framework that utilizes multi-modal inputs from wearable devices to simultaneously capture and interpret human motion, maintaining robust performance with partial data and improving accuracy through integrated sensor modalities and motion descriptions.

Authors:Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian
Title: MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Abstract:
World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.
中文摘要:MineWorld提出了一种基于视觉-动作自回归Transformer的实时交互世界模型,通过并行解码算法和新的评估指标,在Minecraft中生成高质量游戏场景,显著超越了现有最优的扩散模型。
English Summary: MineWorld introduces a real-time interactive world model for Minecraft using a visual-action autoregressive Transformer that generates game scenes by predicting tokens from visual and action inputs, achieving superior performance over state-of-the-art models through parallel decoding and novel evaluation metrics.

Authors:Alireza Salemi, Chris Samarinas, Hamed Zamani
Title: Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation
Abstract:
This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.
中文: 本文提出的规划与优化框架通过两阶段设计提升大语言模型响应的多样性和全面性,在信息检索基准测试中显著优于现有方法。
English: This paper introduces the Plan-and-Refine framework to enhance response diversity and comprehensiveness in large language models, demonstrating significant performance improvements over baselines on information-seeking benchmarks.

Authors:Kaidi Wang, Wenhao Guan, Shenghui Lu, Jianglong Yao, Lin Li, Qingyang Hong
Title: SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow
Abstract:
Recently, flow matching based speech synthesis has significantly enhanced the quality of synthesized speech while reducing the number of inference steps. In this paper, we introduce SlimSpeech, a lightweight and efficient speech synthesis system based on rectified flow. We have built upon the existing speech synthesis method utilizing the rectified flow model, modifying its structure to reduce parameters and serve as a teacher model. By refining the reflow operation, we directly derive a smaller model with a more straight sampling trajectory from the larger model, while utilizing distillation techniques to further enhance the model performance. Experimental results demonstrate that our proposed method, with significantly reduced model parameters, achieves comparable performance to larger models through one-step sampling.
中文:SlimSpeech是一种基于整流流的轻量级语音合成系统,通过减少模型参数并利用一步采样,实现了与大型模型相媲美的性能。
English: SlimSpeech, a lightweight speech synthesis system based on rectified flow, reduces model parameters and achieves comparable performance to larger models through efficient one-step sampling.

Authors:Shouren Wang, Zehua Jiang, Fernando Sliva, Sam Earle, Julian Togelius
Title: Enhancing Player Enjoyment with a Two-Tier DRL and LLM-Based Agent System for Fighting Games
Abstract:
Deep reinforcement learning (DRL) has effectively enhanced gameplay experiences and game design across various game genres. However, few studies on fighting game agents have focused explicitly on enhancing player enjoyment, a critical factor for both developers and players. To address this gap and establish a practical baseline for designing enjoyability-focused agents, we propose a two-tier agent (TTA) system and conducted experiments in the classic fighting game Street Fighter II. The first tier of TTA employs a task-oriented network architecture, modularized reward functions, and hybrid training to produce diverse and skilled DRL agents. In the second tier of TTA, a Large Language Model Hyper-Agent, leveraging players' playing data and feedback, dynamically selects suitable DRL opponents. In addition, we investigate and model several key factors that affect the enjoyability of the opponent. The experiments demonstrate improvements from 64. 36% to 156. 36% in the execution of advanced skills over baseline methods. The trained agents also exhibit distinct game-playing styles. Additionally, we conducted a small-scale user study, and the overall enjoyment in the player's feedback validates the effectiveness of our TTA system.
中文摘要:本研究针对格斗游戏提出了一种双层智能体系统,通过结合深度强化学习与大语言模型动态选择对手,实验表明该系统不仅显著提升了技能执行水平,还通过用户反馈验证了其有效增强玩家游戏乐趣的效果。
English Summary: This study introduces a two-tier agent system for fighting games that combines deep reinforcement learning with a Large Language Model to dynamically select opponents, significantly improving skill execution and player enjoyment as validated by user feedback.

Authors:Hanxiao Sun, YuPeng Gao, Jin Xie, Jian Yang, Beibei Wang
Title: SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering
Abstract:
Reconstructing 3D assets from images, known as inverse rendering (IR), remains a challenging task due to its ill-posed nature. 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities for novel view synthesis (NVS) tasks. Methods apply it to relighting by separating radiance into BRDF parameters and lighting, yet produce inferior relighting quality with artifacts and unnatural indirect illumination due to the limited capability of each Gaussian, which has constant material parameters and normal, alongside the absence of physical constraints for indirect lighting. In this paper, we present a novel framework called Spatially-vayring Gaussian Inverse Rendering (SVG-IR), aimed at enhancing both NVS and relighting quality. To this end, we propose a new representation-Spatially-varying Gaussian (SVG)-that allows per-Gaussian spatially varying parameters. This enhanced representation is complemented by a SVG splatting scheme akin to vertex/fragment shading in traditional graphics pipelines. Furthermore, we integrate a physically-based indirect lighting model, enabling more realistic relighting. The proposed SVG-IR framework significantly improves rendering quality, outperforming state-of-the-art NeRF-based methods by 2.5 dB in peak signal-to-noise ratio (PSNR) and surpassing existing Gaussian-based techniques by 3.5 dB in relighting tasks, all while maintaining a real-time rendering speed.
Chinese: 提出的SVG-IR框架引入了空间变化高斯参数和物理间接光照模型,显著提升了新视角合成和重光照质量,在保持实时渲染的同时,其PSNR指标比现有方法高出2.5-3.5分贝。
English: The proposed SVG-IR framework introduces spatially varying Gaussian parameters and a physical indirect lighting model to significantly enhance novel view synthesis and relighting quality, outperforming existing methods by 2.5-3.5 dB in PSNR while maintaining real-time rendering.

Authors:Christoph Balada, Aida Romano-Martinez, Vincent ten Cate, Katharina Geschke, Jonas Tesarz, Paul Claßen, Alexander K. Schuster, Dativa Tibyampansha, Karl-Patrik Kresoja, Philipp S. Wild, Sheraz Ahmed, Andreas Dengel
Title: Deep Learning for Cardiovascular Risk Assessment: Proxy Features from Carotid Sonography as Predictors of Arterial Damage
Abstract:
In this study, hypertension is utilized as an indicator of individual vascular damage. This damage can be identified through machine learning techniques, providing an early risk marker for potential major cardiovascular events and offering valuable insights into the overall arterial condition of individual patients. To this end, the VideoMAE deep learning model, originally developed for video classification, was adapted by finetuning for application in the domain of ultrasound imaging. The model was trained and tested using a dataset comprising over 31,000 carotid sonography videos sourced from the Gutenberg Health Study (15,010 participants), one of the largest prospective population health studies. This adaptation facilitates the classification of individuals as hypertensive or non-hypertensive (75.7% validation accuracy), functioning as a proxy for detecting visual arterial damage. We demonstrate that our machine learning model effectively captures visual features that provide valuable insights into an individual's overall cardiovascular health.
中文: 本研究通过微调VideoMAE深度学习模型,利用颈动脉超声视频对高血压进行分类(验证准确率75.7%),以此作为检测动脉损伤和评估心血管健康的有效指标。
English: This study adapts the VideoMAE deep learning model to classify hypertension from carotid ultrasound videos, achieving 75.7% accuracy as a proxy for detecting arterial damage and assessing cardiovascular health.

Authors:Shaocong Long, Qianyu Zhou, Xikun Jiang, Chenhao Ying, Lizhuang Ma, Yuan Luo
Title: Domain Generalization via Discrete Codebook Learning
Abstract:
Domain generalization (DG) strives to address distribution shifts across diverse environments to enhance model's generalizability. Current DG approaches are confined to acquiring robust representations with continuous features, specifically training at the pixel level. However, this DG paradigm may struggle to mitigate distribution gaps in dealing with a large space of continuous features, rendering it susceptible to pixel details that exhibit spurious correlations or noise. In this paper, we first theoretically demonstrate that the domain gaps in continuous representation learning can be reduced by the discretization process. Based on this inspiring finding, we introduce a novel learning paradigm for DG, termed Discrete Domain Generalization (DDG). DDG proposes to use a codebook to quantize the feature map into discrete codewords, aligning semantic-equivalent information in a shared discrete representation space that prioritizes semantic-level information over pixel-level intricacies. By learning at the semantic level, DDG diminishes the number of latent features, optimizing the utilization of the representation space and alleviating the risks associated with the wide-ranging space of continuous features. Extensive experiments across widely employed benchmarks in DG demonstrate DDG's superior performance compared to state-of-the-art approaches, underscoring its potential to reduce the distribution gaps and enhance the model's generalizability.
中文摘要:领域泛化在处理连续特征时面临分布差异问题,而本文提出的离散领域泛化方法通过特征离散化过程,在共享离散表示空间中优先关注语义级信息,有效缩小分布差距并提升模型泛化能力。
English Summary: Domain generalization (DG) faces challenges with continuous features, but the proposed Discrete Domain Generalization (DDG) method uses feature discretization to reduce distribution gaps and improve model generalizability by focusing on semantic-level information.

Authors:Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, Rui Zhao
Title: On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
Abstract:
Reinforcement Fine-Tuning (RFT) is proved to be greatly valuable for enhancing the reasoning ability of LLMs. Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding. However, these works are at a very early stage and have not examined how suitable RFT actually is for visual tasks. In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations. We start by quantitative comparisons on various tasks, which shows RFT is generally better than SFT on visual tasks. %especially when the number of training samples are limited. To check whether such advantages are brought up by the reasoning process, we design a new reward that encourages the model to ``think'' more, whose results show more thinking can be beneficial for complicated tasks but harmful for simple tasks. We hope this study can provide more insight for the rapid advancements on this topic.
Chinese: 强化微调(RFT)在视觉任务上优于监督微调,但其效果与任务复杂度相关——增强推理过程对复杂任务有益,却可能不利于简单任务。
English: Reinforcement Fine-Tuning (RFT) shows promise in improving LLMs' visual task performance over Supervised Fine-Tuning, with benefits tied to task complexity where increased reasoning helps complex tasks but hinders simple ones.

Authors:Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, Lin Yan
Title: VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Abstract:
We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.
中文: VAPO是一种基于价值的强化学习框架,在AIME 2024数据集上以60.4分创下最佳性能,通过系统设计有效解决了长链推理中的三大挑战,展现出卓越的训练稳定性和效率。
English: VAPO is a novel value-based reinforcement learning framework that achieves state-of-the-art performance on the AIME 2024 dataset with a score of 60.4, demonstrating superior stability and efficiency by addressing key challenges in long chain-of-thought reasoning.

Authors:Shuai Chen, Fanman Meng, Haoran Wei, Chenhao Wu, Qingbo Wu, Linfeng Xu, Hongliang Li
Title: CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation
Abstract:
Few-shot segmentation (FSS) aims to segment new classes using few annotated images. While recent FSS methods have shown considerable improvements by leveraging Segment Anything Model (SAM), they face two critical limitations: insufficient utilization of structural correlations in query images, and significant information loss when converting continuous position priors to discrete point prompts. To address these challenges, we propose CMaP-SAM, a novel framework that introduces contraction mapping theory to optimize position priors for SAM-driven few-shot segmentation. CMaP-SAM consists of three key components: (1) a contraction mapping module that formulates position prior optimization as a Banach contraction mapping with convergence guarantees. This module iteratively refines position priors through pixel-wise structural similarity, generating a converged prior that preserves both semantic guidance from reference images and structural correlations in query images; (2) an adaptive distribution alignment module bridging continuous priors with SAM's binary mask prompt encoder; and (3) a foreground-background decoupled refinement architecture producing accurate final segmentation masks. Extensive experiments demonstrate CMaP-SAM's effectiveness, achieving state-of-the-art performance with 71.1 mIoU on PASCAL-$5^i$ and 56.1 on COCO-$20^i$ datasets.
中文:CMaP-SAM是一种创新框架,通过引入压缩映射理论优化SAM的位置先验,解决了少样本分割中的结构关联利用不足和信息丢失问题,在基准数据集上取得了最优性能。
English: CMaP-SAM is a novel framework that addresses limitations in few-shot segmentation by applying contraction mapping theory to optimize position priors for SAM, achieving state-of-the-art results on benchmark datasets.

Authors:Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
Title: A Taxonomy of Self-Handover
Abstract:
Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics.
中文: 本研究首次提出自我传递的系统分类法,揭示其作为需要双手预期性调整的高度协调动作,并通过视觉语言模型展示了自动化分类的可行性,为发展自适应双臂机器人技术提供了新见解。
English: This study introduces the first systematic taxonomy of self-handover, revealing it as a highly coordinated bimanual action with anticipatory adjustments, and demonstrates its automated classification potential using vision-language models for advancing dual-arm robotics.

Authors:Md Bayazid Hossain, Md Anwarul Islam Himel, Md Abdur Rahim, Shabbir Mahmood, Abu Saleh Musa Miah, Jungpil Shin
Title: Classification of ADHD and Healthy Children Using EEG Based Multi-Band Spatial Features Enhancement
Abstract:
Attention Deficit Hyperactivity Disorder (ADHD) is a common neurodevelopmental disorder in children, characterized by difficulties in attention, hyperactivity, and impulsivity. Early and accurate diagnosis of ADHD is critical for effective intervention and management. Electroencephalogram (EEG) signals have emerged as a non-invasive and efficient tool for ADHD detection due to their high temporal resolution and ability to capture neural dynamics. In this study, we propose a method for classifying ADHD and healthy children using EEG data from the benchmark dataset. There were 61 children with ADHD and 60 healthy children, both boys and girls, aged 7 to 12. The EEG signals, recorded from 19 channels, were processed to extract Power Spectral Density (PSD) and Spectral Entropy (SE) features across five frequency bands, resulting in a comprehensive 190-dimensional feature set. To evaluate the classification performance, a Support Vector Machine (SVM) with the RBF kernel demonstrated the best performance with a mean cross-validation accuracy of 99.2\% and a standard deviation of 0.0079, indicating high robustness and precision. These results highlight the potential of spatial features in conjunction with machine learning for accurately classifying ADHD using EEG data. This work contributes to developing non-invasive, data-driven tools for early diagnosis and assessment of ADHD in children.
中文: 本研究利用脑电图数据,通过提取功率谱密度和谱熵特征对儿童多动症进行分类,采用支持向量机分类器达到99.2%的准确率,证明了机器学习在非侵入性多动症诊断中的有效性。
English: This study uses EEG data to classify ADHD in children by extracting Power Spectral Density and Spectral Entropy features, achieving 99.2% accuracy with an SVM classifier, demonstrating the effectiveness of machine learning for non-invasive ADHD diagnosis.

Authors:Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez
Title: OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Abstract:
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
中文:OmniDrive数据集通过反事实推理将视觉语言模型与三维理解结合,提升了自动驾驶的决策能力,并在基准测试中表现出显著改进。
English: The OmniDrive dataset enhances autonomous driving by integrating 3D understanding with vision-language models through counterfactual reasoning, improving decision-making and performance on benchmarks.

Authors:Jungpil Shin, Abu Saleh Musa Miah, Sota Konnai, Shu Hoshitaka, Pankoo Kim
Title: Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics
Abstract:
Hand gesture recognition using multichannel surface electromyography (sEMG) is challenging due to unstable predictions and inefficient time-varying feature enhancement. To overcome the lack of signal based time-varying feature problems, we propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach to build an effective sEMG-based hand gesture recognition system. Each branch of the proposed model was designed to extract hierarchical features, capturing both global and detailed spatial-temporal relationships to ensure feature effectiveness. The first branch, utilizing a Bidirectional-TCN (Bi-TCN), focuses on capturing long-term temporal dependencies by modelling past and future temporal contexts, providing a holistic view of gesture dynamics. The second branch, incorporating a 1D Convolutional layer, separable CNN, and Squeeze-and-Excitation (SE) block, efficiently extracts spatial-temporal features while emphasizing critical feature channels, enhancing feature relevance. The third branch, combining a Temporal Convolutional Network (TCN) and Bidirectional LSTM (BiLSTM), captures bidirectional temporal relationships and time-varying patterns. Outputs from all branches are fused using concatenation to capture subtle variations in the data and then refined with a channel attention module, selectively focusing on the most informative features while improving computational efficiency. The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively. These results demonstrate the capability of the system to handle complex sEMG dynamics, offering advancements in prosthetic limb control and human-machine interface technologies with significant implications for assistive technologies.
中文: 本研究提出了一种轻量级多流深度学习模型,能有效提取表面肌电信号的时空动态特征,在多个数据集上实现了高精度手势识别,为假肢控制技术提供了重要进展。
English: This study introduces a lightweight multi-stream deep learning model that effectively captures spatial-temporal dynamics from sEMG signals, achieving high gesture recognition accuracy across multiple datasets and advancing prosthetic control technologies.

Authors:Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu
Title: MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
Abstract:
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.
中文:MegaScale-Infer是一种创新系统,通过解耦注意力与前馈网络模块、采用专门并行策略及高性能通信库,有效服务大规模专家混合模型,显著提升GPU吞吐量并降低运营成本。
English: MegaScale-Infer is an innovative system designed to efficiently serve large-scale Mixture-of-Experts models by disaggregating attention and feed-forward network modules, employing specialized parallelism strategies and a high-performance communication library to significantly boost GPU throughput and reduce operational costs.

Authors:Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose M. Alvarez
Title: MDP: Multidimensional Vision Model Pruning with Latency Constraint
Abstract:
Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.
中文: 本文提出多维剪枝(MDP)新范式,通过联合优化多种剪枝粒度和采用先进延迟建模技术,克服了现有方法在剪枝粒度单一和延迟评估不准的局限,在CNN和Transformer上均显著优于先前方法,实现了更优的精度-延迟平衡。
English: This paper introduces Multi-Dimensional Pruning (MDP), a novel paradigm that overcomes limitations of existing methods by jointly optimizing across multiple pruning granularities and employing advanced latency modeling to achieve superior accuracy-latency trade-offs, significantly outperforming prior approaches on both CNNs and transformers.

Authors:Jincheng Mei, Bo Dai, Alekh Agarwal, Mohammad Ghavamzadeh, Csaba Szepesvari, Dale Schuurmans
Title: Ordering-based Conditions for Global Convergence of Policy Gradient Methods
Abstract:
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.
中文: 研究表明,在线性函数逼近中,策略梯度方法的全局收敛性取决于表示特性而非逼近误差,其中Softmax PG和自然策略梯度算法需要满足不同的条件。
English: The study demonstrates that global convergence of policy gradient methods in linear function approximation depends on representation properties rather than approximation error, with distinct conditions required for Softmax PG and natural policy gradient algorithms.

Authors:Bin Han, Fabienne Renckens, C. Clark Cao, Hans D. Schotten
Title: A Novel Dynamic Epidemic Model for Successive Opinion Diffusion in Social Networks
Abstract:
This paper proposes a dynamic epidemic model for successive opinion diffusion in social networks, extending the SHIMR model. It incorporates dynamic decision-making influenced by social distances and captures accumulative opinion diffusion caused by interrelated rumors. The model reflects the impact of rumor spread on social network structures. Simulations validate its effectiveness in explaining phenomena like the echo chamber effect and provide insights into opinion diffusion dynamics, with implications for understanding social polarization and network evolution.
Chinese: 本文提出了一种社交网络中连续观点传播的动态流行病模型,扩展了SHIMR模型,通过引入社会距离和相互关联的谣言来分析它们对网络结构及回声室等现象的影响。
English: This paper introduces a dynamic epidemic model for successive opinion diffusion in social networks, extending the SHIMR model by incorporating social distances and interrelated rumors to analyze their impact on network structures and phenomena like echo chambers.

Authors:Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi
Title: Plan-and-Act using Large Language Models for Interactive Agreement
Abstract:
Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios.
中文摘要:本文提出一种技能设计,使大型语言模型能够在人机交互中平衡尊重人类活动与机器人任务优先级,并确定行动规划的最佳时机,在测试场景中成功率高达90%。
English Summary: Recent large language models can plan robot actions, and this paper proposes a skill design that enables them to balance respecting human activities with task priorities and determine optimal timing for action planning in human-robot interaction scenarios, achieving a 90% success rate.

Authors:Shunxin Chen, Ajian Liu, Junze Zheng, Jun Wan, Kailai Peng, Sergio Escalera, Zhen Lei
Title: Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection
Abstract:
Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.
Chinese: 提出的FG-MoE-CLIP-CAR框架通过专家特征处理和类感知正则化,有效区分真实与伪造人脸,在物理-数字混合攻击数据集上实现了最先进的性能。
English: The proposed FG-MoE-CLIP-CAR framework enhances facial recognition security by employing specialized feature processing and class-aware regularization to effectively distinguish between live and fake faces, achieving state-of-the-art results on combined physical-digital attack datasets.

Authors:Yongze Li, Ning Li, Ajian Liu, Hui Ma, Liying Yang, Xihong Chen, Zhiyao Liang, Yanyan Liang, Jun Wan, Zhen Lei
Title: FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection
Abstract:
Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.
中文: 提出的FA³-CLIP模型通过攻击无关提示学习和双流特征融合,结合空间与频域信息,实现了对物理和数字人脸攻击的统一检测。
English: The proposed FA³-CLIP model introduces attack-agnostic prompt learning and dual-stream feature fusion to enable unified detection of both physical and digital face attacks by effectively combining spatial and frequency information.

Authors:Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, Koushil Sreenath
Title: LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning
Abstract:
General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into humanoid whole-body motion remains a significant challenge, primarily due to the gap between linguistic understanding and physical actions. In this work, we present an end-to-end, language-directed policy for real-world humanoid whole-body control. Our approach combines reinforcement learning with policy distillation, allowing a single neural network to interpret language commands and execute corresponding physical actions directly. To enhance motion diversity and compositionality, we incorporate a Conditional Variational Autoencoder (CVAE) structure. The resulting policy achieves agile and versatile whole-body behaviors conditioned on language inputs, with smooth transitions between various motions, enabling adaptation to linguistic variations and the emergence of novel motions. We validate the efficacy and generalizability of our method through extensive simulations and real-world experiments, demonstrating robust whole-body control. Please see our website at LangWBC.github.io for more information.
中文: 本研究提出了一种端到端的语言导向策略,结合强化学习、策略蒸馏和条件变分自编码器结构,使人形机器人能够根据语言指令直接执行多样灵活的全身动作,并通过仿真和实际实验验证了其有效性。
English: This study introduces an end-to-end language-directed policy that integrates reinforcement learning with policy distillation and a CVAE structure, enabling humanoid robots to execute diverse, agile whole-body motions directly from language commands, validated through simulations and real-world experiments.

Authors:Hassan Sartaj, Jalil Boudjadar, Mirgita Frasheri, Shaukat Ali, Peter Gorm Larsen
Title: Identifying Uncertainty in Self-Adaptive Robotics with Large Language Models
Abstract:
Future self-adaptive robots are expected to operate in highly dynamic environments while effectively managing uncertainties. However, identifying the sources and impacts of uncertainties in such robotic systems and defining appropriate mitigation strategies is challenging due to the inherent complexity of self-adaptive robots and the lack of comprehensive knowledge about the various factors influencing uncertainty. Hence, practitioners often rely on intuition and past experiences from similar systems to address uncertainties. In this article, we evaluate the potential of large language models (LLMs) in enabling a systematic and automated approach to identify uncertainties in self-adaptive robotics throughout the software engineering lifecycle. For this evaluation, we analyzed 10 advanced LLMs with varying capabilities across four industrial-sized robotics case studies, gathering the practitioners' perspectives on the LLM-generated responses related to uncertainties. Results showed that practitioners agreed with 63-88% of the LLM responses and expressed strong interest in the practicality of LLMs for this purpose.
中文摘要:未来自适应机器人在动态环境中管理不确定性面临挑战,但研究表明大型语言模型能系统识别这些不确定性,从业者对63-88%的AI生成回应表示认可,并对其实际应用表现出浓厚兴趣。
English Summary: Future self-adaptive robots face challenges in managing uncertainties, but this study demonstrates that large language models can systematically identify these uncertainties, with practitioners agreeing with 63-88% of the AI-generated responses and showing strong interest in their practical application.

Authors:Hassan Sartaj, Jalil Boudjadar, Mirgita Frasheri, Shaukat Ali, Peter Gorm Larsen
Title: Identifying Uncertainty in Self-Adaptive Robotics with Large Language Models
Abstract:
Future self-adaptive robots are expected to operate in highly dynamic environments while effectively managing uncertainties. However, identifying the sources and impacts of uncertainties in such robotic systems and defining appropriate mitigation strategies is challenging due to the inherent complexity of self-adaptive robots and the lack of comprehensive knowledge about the various factors influencing uncertainty. Hence, practitioners often rely on intuition and past experiences from similar systems to address uncertainties. In this article, we evaluate the potential of large language models (LLMs) in enabling a systematic and automated approach to identify uncertainties in self-adaptive robotics throughout the software engineering lifecycle. For this evaluation, we analyzed 10 advanced LLMs with varying capabilities across four industrial-sized robotics case studies, gathering the practitioners' perspectives on the LLM-generated responses related to uncertainties. Results showed that practitioners agreed with 63-88% of the LLM responses and expressed strong interest in the practicality of LLMs for this purpose.
中文摘要:未来自适应机器人在动态环境中管理不确定性面临挑战,但研究表明大型语言模型能系统识别这些不确定性,从业者对63-88%的AI生成回应表示认可,并对其实际应用表现出浓厚兴趣。
English Summary: Future self-adaptive robots face challenges in managing uncertainties, but this study demonstrates that large language models can systematically identify these uncertainties, with practitioners agreeing with 63-88% of the AI-generated responses and showing strong interest in their practical application.

Authors:Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen
Title: Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection
Abstract:
Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim at enhancing the ability to perceive the geometry of scenes by leveraging temporal geometric consistency. Second, we strive to improve the integrity of lanes by revealing more instance information from temporal sequences. Therefore, we propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection. On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames, facilitating effective geometry perception. On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation, thereby enabling the exploration of comprehensive instance information. Experiments demonstrate that our GTA-Net achieves SoTA results, surpassing existing monocular 3D lane detection solutions.
中文: 本文提出GTA-Net这一新型几何感知时序聚合网络,通过利用多帧间的时序几何一致性和实例信息,有效提升了单目3D车道检测的精度与完整性。
English: This paper introduces GTA-Net, a novel geometry-aware temporal aggregation network that enhances 3D lane detection accuracy and integrity by leveraging temporal geometric consistency and instance information across multiple frames.

Authors:Maria Khelli, Samuel Cahyawijaya, Ayu Purwarianti, Genta Indra Winata
Title: What Causes Knowledge Loss in Multilingual Language Models?
Abstract:
Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explores this issue in multilingual contexts, focusing on linguistic differences affecting representational learning rather than just model parameters. We experiment with 52 languages using LoRA adapters of varying ranks to evaluate non-shared, partially shared, and fully shared parameters. Our aim is to see if parameter sharing through adapters can mitigate forgetting while preserving prior knowledge. We find that languages using non-Latin scripts are more susceptible to catastrophic forgetting, whereas those written in Latin script facilitate more effective cross-lingual transfer.
中文: 跨语言迁移在自然语言处理中提升多语言能力,但面临灾难性遗忘等挑战,尤其是非拉丁文字语言更易受影响;本研究通过测试参数共享适配器,探索在52种语言中如何保留先前知识。
English: Cross-lingual transfer in NLP improves multilingual capabilities but faces challenges like catastrophic forgetting, especially for non-Latin script languages, which our study addresses by testing parameter-sharing adapters to preserve knowledge across 52 languages.

Authors:Marina Mayor-Rocher, Cristina Pozo, Nina Melero, Gonzalo Martínez, María Grandury, Pedro Reviriego
Title: It's the same but not the same: Do LLMs distinguish Spanish varieties?
Abstract:
In recent years, large language models (LLMs) have demonstrated a high capacity for understanding and generating text in Spanish. However, with five hundred million native speakers, Spanish is not a homogeneous language but rather one rich in diatopic variations spanning both sides of the Atlantic. For this reason, in this study, we evaluate the ability of nine language models to identify and distinguish the morphosyntactic and lexical peculiarities of seven varieties of Spanish (Andean, Antillean, Continental Caribbean, Chilean, Peninsular, Mexican and Central American and Rioplatense) through a multiple-choice test. The results indicate that the Peninsular Spanish variety is the best identified by all models and that, among them, GPT-4o is the only model capable of recognizing the variability of the Spanish language. -- En los últimos años, los grandes modelos de lenguaje (LLMs, por sus siglas en inglés) han demostrado una alta capacidad para comprender y generar texto en español. Sin embargo, con quinientos millones de hablantes nativos, la española no es una lengua homogénea, sino rica en variedades diatópicas que se extienden a ambos lados del Atlántico. Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosintácticas y léxicas de siete variedades de español (andino, antillano, caribeño continental, chileno, español peninsular, mexicano y centroamericano y rioplatense) mediante un test de respuesta múltiple. Los resultados obtenidos indican que la variedad de español peninsular es la mejor identificada por todos los modelos y que, de entre todos, GPT-4o es el único modelo capaz de identificar la variabilidad de la lengua española.
中文:本研究评估了九种语言模型识别七种西班牙语方言变体的能力,发现半岛西班牙语最易被识别,且GPT-4o是唯一能识别西班牙语多样性的模型。
English: This study evaluates nine language models' ability to identify seven Spanish dialect variations, finding Peninsular Spanish most recognizable and GPT-4o as the only model capable of recognizing Spanish linguistic diversity.

Authors:Erblin Isaku, Hassan Sartaj, Shaukat Ali
Title: Digital Twin-based Out-of-Distribution Detection in Autonomous Vessels
Abstract:
An autonomous vessel (AV) is a complex cyber-physical system (CPS) with software enabling many key functionalities, e.g., navigation software enables an AV to autonomously or semi-autonomously follow a path to its destination. Digital twins of such AVs enable advanced functionalities such as running what-if scenarios, performing predictive maintenance, and enabling fault diagnosis. Due to technological improvements, real-time analyses using continuous data from vessels' real-time operations have become increasingly possible. However, the literature has little explored developing advanced analyses in real-time data in AVs with digital twins built with machine learning techniques. To this end, we present a novel digital twin-based approach (ODDIT) to detect future out-of-distribution (OOD) states of an AV before reaching them, enabling proactive intervention. Such states may indicate anomalies requiring attention (e.g., manual correction by the ship master) and assist testers in scenario-centered testing. The digital twin consists of two machine-learning models predicting future vessel states and whether the predicted state will be OOD. We evaluated ODDIT with five vessels across waypoint and zigzag maneuvering under simulated conditions, including sensor and actuator noise and environmental disturbances i.e., ocean current. ODDIT achieved high accuracy in detecting OOD states, with AUROC and TNR@TPR95 scores reaching 99\% across multiple vessels.
中文: ODDIT方法采用基于机器学习的数字孪生技术,可在自主船舶到达前主动预测其未来异常状态,并在模拟环境中实现了高精度检测。
English: The ODDIT approach uses a digital twin with machine learning models to proactively detect future out-of-distribution states in autonomous vessels, achieving high detection accuracy under simulated conditions.

Authors:Yizhe Zhang, Jianping Li, Xin Zhao, Fuxun Liang, Zhen Dong, Bisheng Yang
Title: ARMOR: Adaptive Meshing with Reinforcement Optimization for Real-time 3D Monitoring in Unexposed Scenes
Abstract:
Unexposed environments, such as lava tubes, mines, and tunnels, are among the most complex yet strategically significant domains for scientific exploration and infrastructure development. Accurate and real-time 3D meshing of these environments is essential for applications including automated structural assessment, robotic-assisted inspection, and safety monitoring. Implicit neural Signed Distance Fields (SDFs) have shown promising capabilities in online meshing; however, existing methods often suffer from large projection errors and rely on fixed reconstruction parameters, limiting their adaptability to complex and unstructured underground environments such as tunnels, caves, and lava tubes. To address these challenges, this paper proposes ARMOR, a scene-adaptive and reinforcement learning-based framework for real-time 3D meshing in unexposed environments. The proposed method was validated across more than 3,000 meters of underground environments, including engineered tunnels, natural caves, and lava tubes. Experimental results demonstrate that ARMOR achieves superior performance in real-time mesh reconstruction, reducing geometric error by 3.96\% compared to state-of-the-art baselines, while maintaining real-time efficiency. The method exhibits improved robustness, accuracy, and adaptability, indicating its potential for advanced 3D monitoring and mapping in challenging unexposed scenarios. The project page can be found at: https://yizhezhang0418.github.io/armor.github.io/
中文: 本文提出ARMOR框架,通过强化学习实现隧道、洞穴等未暴露环境的实时三维网格重建,相比现有方法几何误差降低3.96%的同时保持实时效率。
English: This paper introduces ARMOR, a reinforcement learning-based framework that enables real-time 3D meshing in unexposed environments like tunnels and caves, achieving a 3.96% reduction in geometric error while maintaining efficiency compared to existing methods.

Authors:Han Zhang, Hao Zhou, Medhat Elsayed, Majid Bavand, Raimundas Gaigalas, Yigit Ozcan, Melike Erol-Kantarci
Title: Intelligent Attacks and Defense Methods in Federated Learning-enabled Energy-Efficient Wireless Networks
Abstract:
Federated learning (FL) is a promising technique for learning-based functions in wireless networks, thanks to its distributed implementation capability. On the other hand, distributed learning may increase the risk of exposure to malicious attacks where attacks on a local model may spread to other models by parameter exchange. Meanwhile, such attacks can be hard to detect due to the dynamic wireless environment, especially considering local models can be heterogeneous with non-independent and identically distributed (non-IID) data. Therefore, it is critical to evaluate the effect of malicious attacks and develop advanced defense techniques for FL-enabled wireless networks. In this work, we introduce a federated deep reinforcement learning-based cell sleep control scenario that enhances the energy efficiency of the network. We propose multiple intelligent attacks targeting the learning-based approach and we propose defense methods to mitigate such attacks. In particular, we have designed two attack models, generative adversarial network (GAN)-enhanced model poisoning attack and regularization-based model poisoning attack. As a counteraction, we have proposed two defense schemes, autoencoder-based defense, and knowledge distillation (KD)-enabled defense. The autoencoder-based defense method leverages an autoencoder to identify the malicious participants and only aggregate the parameters of benign local models during the global aggregation, while KD-based defense protects the model from attacks by controlling the knowledge transferred between the global model and local models.
中文: 无线网络中的联邦学习通过小区休眠控制提升能效,但面临生成对抗网络增强型和基于正则化的模型中毒等智能攻击风险,采用自编码器和知识蒸馏防御方案来保障模型聚合与知识传递的安全。
English: Federated learning in wireless networks enhances energy efficiency through cell sleep control but faces risks from intelligent attacks like GAN-enhanced and regularization-based model poisoning, countered by autoencoder and knowledge distillation defense methods to secure model aggregation and knowledge transfer.

Authors:Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy L. Nguyen, David P. Woodruff, Peilin Zhong
Title: Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures
Abstract:
In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.
Chinese: 本文首次提出了在旋转门流模型中解决最大覆盖问题的算法,实现了多对数更新时间的效率,并针对数据集中的重识别风险,开发了高效的靶向和通用指纹识别流算法。
English: This paper introduces the first algorithm for solving the maximum coverage problem in the turnstile streaming model, achieving poly-logarithmic update time, and also presents efficient streaming algorithms for targeted and general fingerprinting to assess re-identification risks in datasets.

Authors:Zezhou Chen, Zhaoxiang Liu, Kai Wang, Kohou Wang, Shiguo Lian
Title: A Large Vision-Language Model based Environment Perception System for Visually Impaired People
Abstract:
It is a challenging task for visually impaired people to perceive their surrounding environment due to the complexity of the natural scenes. Their personal and social activities are thus highly limited. This paper introduces a Large Vision-Language Model(LVLM) based environment perception system which helps them to better understand the surrounding environment, by capturing the current scene they face with a wearable device, and then letting them retrieve the analysis results through the device. The visually impaired people could acquire a global description of the scene by long pressing the screen to activate the LVLM output, retrieve the categories of the objects in the scene resulting from a segmentation model by tapping or swiping the screen, and get a detailed description of the objects they are interested in by double-tapping the screen. To help visually impaired people more accurately perceive the world, this paper proposes incorporating the segmentation result of the RGB image as external knowledge into the input of LVLM to reduce the LVLM's hallucination. Technical experiments on POPE, MME and LLaVA-QA90 show that the system could provide a more accurate description of the scene compared to Qwen-VL-Chat, exploratory experiments show that the system helps visually impaired people to perceive the surrounding environment effectively.
中文摘要:本文提出了一种基于大视觉语言模型的可穿戴环境感知系统,通过捕捉场景和交互检索帮助视障人士理解周围环境,并结合图像分割结果作为外部知识输入以减少模型幻觉,提高描述准确性。
English Summary: This paper presents a wearable system using a Large Vision-Language Model to help visually impaired people understand their environment through scene capture and interactive retrieval, with enhanced accuracy by integrating segmentation results to reduce model hallucination.

Authors:Keyang Ye, Tianjia Shao, Kun Zhou
Title: When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering
Abstract:
We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes -- surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.
中文摘要:高斯增强表面元(GESs)提出了一种双尺度表示法,结合二维表面元处理粗略几何与三维高斯分布补充细节,通过无排序渲染实现了超快速的高保真辐射场重建,并保持视角一致性。
English Summary: Gaussian-enhanced Surfels (GESs) introduce a bi-scale representation combining 2D surfels for coarse geometry and 3D Gaussians for fine details, achieving ultra-fast, high-fidelity radiance field rendering through sorting-free processing and view-consistent results.

Authors:Min Wei, Chaohui Yu, Jingkai Zhou, Fan Wang
Title: 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
Abstract:
Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link https://2y7c3.github.io/3DV-TON/
中文: 该研究提出3DV-TON框架,通过可动画3D网格生成高保真且时序一致的视频试穿效果,其性能优于现有方法,并配有新的基准数据集支持。
English: The study introduces 3DV-TON, a diffusion-based framework that uses animatable 3D meshes to produce high-fidelity and temporally consistent video try-on results, outperforming existing methods and supported by a new benchmark dataset.

Authors:Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
Title: TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Abstract:
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
中文: TimeChat-Online通过受人类视觉启发的差分令牌丢弃模块,在保持98%性能的同时将流媒体视频的视觉冗余减少82.8%,其主动响应能力实现了高效的实时视频交互。
English: TimeChat-Online introduces a Differential Token Drop module inspired by human visual perception to reduce visual redundancy in streaming videos by 82.8% while maintaining 98% performance, enabling efficient real-time video interaction through its Proactive Response capability.

Authors:Süleyman Özdel, Kadir Burak Buldu, Enkelejda Kasneci, Efe Bozkir
Title: Exploring Context-aware and LLM-driven Locomotion for Immersive Virtual Reality
Abstract:
Locomotion plays a crucial role in shaping the user experience within virtual reality environments. In particular, hands-free locomotion offers a valuable alternative by supporting accessibility and freeing users from reliance on handheld controllers. To this end, traditional speech-based methods often depend on rigid command sets, limiting the naturalness and flexibility of interaction. In this study, we propose a novel locomotion technique powered by large language models (LLMs), which allows users to navigate virtual environments using natural language with contextual awareness. We evaluate three locomotion methods: controller-based teleportation, voice-based steering, and our language model-driven approach. Our evaluation measures include eye-tracking data analysis, including explainable machine learning through SHAP analysis as well as standardized questionnaires for usability, presence, cybersickness, and cognitive load to examine user attention and engagement. Our findings indicate that the LLM-driven locomotion possesses comparable usability, presence, and cybersickness scores to established methods like teleportation, demonstrating its novel potential as a comfortable, natural language-based, hands-free alternative. In addition, it enhances user attention within the virtual environment, suggesting greater engagement. Complementary to these findings, SHAP analysis revealed that fixation, saccade, and pupil-related features vary across techniques, indicating distinct patterns of visual attention and cognitive processing. Overall, we state that our method can facilitate hands-free locomotion in virtual spaces, especially in supporting accessibility.
中文: 本研究提出了一种基于大语言模型的新型虚拟现实无手柄移动技术,通过自然语言导航实现,与传统方法相比具有相当的可用性并提升了用户参与度。
English: This study introduces a novel hands-free locomotion technique for virtual reality using large language models, which enables natural language navigation and demonstrates comparable usability and enhanced user engagement compared to traditional methods.

Authors:Yufeng Chi, Qiayuan Liao, Junfeng Long, Xiaoyu Huang, Sophia Shao, Borivoje Nikolic, Zhongyu Li, Koushil Sreenath
Title: Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot
Abstract:
Despite significant interest and advancements in humanoid robotics, most existing commercially available hardware remains high-cost, closed-source, and non-transparent within the robotics community. This lack of accessibility and customization hinders the growth of the field and the broader development of humanoid technologies. To address these challenges and promote democratization in humanoid robotics, we demonstrate Berkeley Humanoid Lite, an open-source humanoid robot designed to be accessible, customizable, and beneficial for the entire community. The core of this design is a modular 3D-printed gearbox for the actuators and robot body. All components can be sourced from widely available e-commerce platforms and fabricated using standard desktop 3D printers, keeping the total hardware cost under $5,000 (based on U.S. market prices). The design emphasizes modularity and ease of fabrication. To address the inherent limitations of 3D-printed gearboxes, such as reduced strength and durability compared to metal alternatives, we adopted a cycloidal gear design, which provides an optimal form factor in this context. Extensive testing was conducted on the 3D-printed actuators to validate their durability and alleviate concerns about the reliability of plastic components. To demonstrate the capabilities of Berkeley Humanoid Lite, we conducted a series of experiments, including the development of a locomotion controller using reinforcement learning. These experiments successfully showcased zero-shot policy transfer from simulation to hardware, highlighting the platform's suitability for research validation. By fully open-sourcing the hardware design, embedded code, and training and deployment frameworks, we aim for Berkeley Humanoid Lite to serve as a pivotal step toward democratizing the development of humanoid robotics. All resources are available at https://lite.berkeley-humanoid.org.
Chinese: Berkeley Humanoid Lite是一款开源低成本人形机器人,采用模块化3D打印组件设计,旨在提升机器人研究的可及性和可定制性,并通过成功的仿真到硬件策略迁移验证了其性能。
English: The Berkeley Humanoid Lite is an open-source, low-cost humanoid robot designed with modular 3D-printed components to enhance accessibility and customization in robotics research, validated through successful simulation-to-hardware policy transfers.

Authors:Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai
Title: Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Abstract:
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
中文摘要:本研究批评当前视觉语言模型在指代表达生成中忽视语用维度,通过新数据集和评估揭示了模型在指称唯一性、信息相关性及与人类交流偏好对齐方面的不足,呼吁建立符合实际沟通的语用评估框架。
English Summary: This study critiques current vision-language models for overlooking pragmatic aspects in Referring Expression Generation, proposing a new dataset and evaluation that highlight failures in referential uniqueness, information relevance, and alignment with human communication patterns.

Authors:Hailan Yang, Zhenyu Qi, Shuchang Liu, Xiaoyu Yang, Xiaobei Wang, Xiang Li, Lantao Hu, Han Li, Kun Gai
Title: Comprehensive List Generation for Multi-Generator Reranking
Abstract:
Reranking models solve the final recommendation lists that best fulfill users' demands. While existing solutions focus on finding parametric models that approximate optimal policies, recent approaches find that it is better to generate multiple lists to compete for a ``pass'' ticket from an evaluator, where the evaluator serves as the supervisor who accurately estimates the performance of the candidate lists. In this work, we show that we can achieve a more efficient and effective list proposal with a multi-generator framework and provide empirical evidence on two public datasets and online A/B tests. More importantly, we verify that the effectiveness of a generator is closely related to how much it complements the views of other generators with sufficiently different rerankings, which derives the metric of list comprehensiveness. With this intuition, we design an automatic complementary generator-finding framework that learns a policy that simultaneously aligns the users' preferences and maximizes the list comprehensiveness metric. The experimental results indicate that the proposed framework can further improve the multi-generator reranking performance.
中文摘要:本研究提出一种多生成器重排序框架,通过生成多样候选列表并利用互补性优化列表全面性,从而提升推荐系统的效果与效率。
English Summary: This study introduces a multi-generator reranking framework that enhances recommendation efficiency by generating diverse candidate lists and optimizing their comprehensiveness through complementary generator selection.

Authors:Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li
Title: StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
Abstract:
3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.
中文摘要:3D高斯泼溅技术虽能实现逼真场景重建,却在风格化场景中表现不佳;StyleMe3D框架通过多模态风格调节和多层级语义对齐,在保持几何完整性的同时实现连贯的三维风格迁移,弥合了写实与艺术化表达之间的鸿沟。
English Summary: 3D Gaussian Splatting achieves photorealistic reconstruction but falters with stylized scenes, prompting StyleMe3D—a comprehensive framework using multi-modal conditioning and semantic alignment to enable coherent 3D style transfer while preserving geometry and real-time performance.

Authors:Hongbin Xu, Chaohui Yu, Feng Xiao, Jiazheng Xing, Hai Ci, Weitao Chen, Fan Wang, Ming Li
Title: Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization
Abstract:
Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose \name{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. \emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. \emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that \name{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17\% PSNR for edge, +6.26\% PSNR for sketch).
中文摘要:提出的 \name{} 框架通过强制再生3D内容与原始输入控制之间的循环一致性,显著提升了在各种条件下生成内容与输入控制的匹配精度。
English Summary: The proposed \name{} framework enhances controllable 3D generation by enforcing cyclic consistency between regenerated 3D content and original input controls, significantly improving alignment accuracy across various conditions.

Authors:Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao
Title: ViMo: A Generative Visual GUI World Model for App Agents
Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.
中文摘要:ViMo是一种创新的视觉世界模型,通过分离图形与文本生成未来应用界面图像,使应用代理能通过更优规划做出更明智决策。
English Summary: ViMo is a novel visual world model that generates future app interface images by separating graphics and text, enabling app agents to make better decisions through improved planning.

Authors:Zhen Wen, Luoxuan Weng, Yinghao Tang, Runjin Zhang, Yuxin Liu, Bo Pan, Minfeng Zhu, Wei Chen
Title: Exploring Multimodal Prompt for Visualization Authoring with Large Language Models
Abstract:
Recent advances in large language models (LLMs) have shown great potential in automating the process of visualization authoring through simple natural language utterances. However, instructing LLMs using natural language is limited in precision and expressiveness for conveying visualization intent, leading to misinterpretation and time-consuming iterations. To address these limitations, we conduct an empirical study to understand how LLMs interpret ambiguous or incomplete text prompts in the context of visualization authoring, and the conditions making LLMs misinterpret user intent. Informed by the findings, we introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent and improve LLMs' interpretation abilities. To explore the potential of multimodal prompting in visualization authoring, we design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations on existing visualizations. Through two case studies and a controlled user study, we demonstrate that VisPilot provides a more intuitive way to create visualizations without affecting the overall task efficiency compared to text-only prompting approaches. Furthermore, we analyze the impact of text and visual prompts in different visualization tasks. Our findings highlight the importance of multimodal prompting in improving the usability of LLMs for visualization authoring. We discuss design implications for future visualization systems and provide insights into how multimodal prompts can enhance human-AI collaboration in creative visualization tasks. All materials are available at https://OSF.IO/2QRAK.
中文: 大型语言模型在通过自然语言自动化创建可视化方面展现出潜力,但存在精度和表达力不足的问题,易导致误解和耗时迭代;为此,研究引入视觉提示作为补充输入方式,设计VisPilot系统支持多模态提示,实现更直观的可视化创作,在不影响效率的同时提升可用性。
English: Recent advances in large language models (LLMs) show potential for automating visualization authoring through natural language, but face limitations in precision and expressiveness, leading to misinterpretations and time-consuming iterations; to address this, the study introduces visual prompts as a complementary input modality, designing VisPilot to enable intuitive multimodal prompting that improves usability without sacrificing efficiency.

Authors:Yan Yang, Yixia Li, Hongru Wang, Xuetao Wei, Jianqiao Yu, Yun Chen, Guanhua Chen
Title: ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Abstract:
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
中文: ImPart提出了一种基于重要性的增量稀疏化方法,通过奇异值分解动态调整不同奇异向量的稀疏率,在保持关键任务知识的同时实现2倍压缩比提升,并在增量量化和模型融合领域创下新纪录。
English: ImPart introduces an importance-aware delta sparsification method that dynamically adjusts sparsity ratios using SVD to preserve critical task-specific knowledge, achieving a 2× higher compression ratio and setting new benchmarks in delta quantization and model merging.

Authors:Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, Jiangmiao Pang
Title: Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation
Abstract:
Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from imprecise physical simulation caused by inaccurate geometric reconstruction. This paper introduces RoboSplat, a novel method that generates diverse, visually realistic demonstrations by directly manipulating 3D Gaussians. Specifically, we reconstruct the scene through 3D Gaussian Splatting (3DGS), directly edit the reconstructed scene, and augment data across six types of generalization with five techniques: 3D Gaussian replacement for varying object types, scene appearance, and robot embodiments; equivariant transformations for different object poses; visual attribute editing for various lighting conditions; novel view synthesis for new camera perspectives; and 3D content generation for diverse object types. Comprehensive real-world experiments demonstrate that RoboSplat significantly enhances the generalization of visuomotor policies under diverse disturbances. Notably, while policies trained on hundreds of real-world demonstrations with additional 2D data augmentation achieve an average success rate of 57.2%, RoboSplat attains 87.8% in one-shot settings across six types of generalization in the real world.
Chinese: RoboSplat通过3D高斯泼溅技术生成多样且逼真的演示数据,有效解决了视觉运动策略训练中的数据限制问题,在真实世界的一次性设置中实现了87.8%的成功率,显著提升了策略的泛化能力。
English: RoboSplat addresses the limitations of visuomotor policy training by using 3D Gaussian Splatting to generate diverse and realistic demonstrations, significantly improving generalization and achieving an 87.8% success rate in real-world one-shot settings.

Authors:Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
Title: It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Abstract:
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
Chinese: 本研究提出Miras框架,通过将现有神经网络重构为具有新型注意力偏置目标和保留门控的联想记忆模块,开发出Moneta等模型,在语言建模等专业任务中超越Transformer性能。
English: This research introduces Miras, a framework for designing neural architectures that reconceptualizes existing models as associative memory modules with novel attentional bias objectives and retention gates, yielding models like Moneta and Yaad that outperform Transformers in specialized tasks.

Authors:Yongrui Chen, Junhao He, Linbo Fu, Shenyu Zhang, Rihui Jin, Xinbang Dai, Jiaqi Li, Dehai Min, Nan Hu, Yuxin Zhang, Guilin Qi, Yi Huang, Tongtong Wu
Title: Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge
Abstract:
Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.
中文: 本文提出Pandora框架,利用Python的Pandas API构建统一知识表示以对齐大语言模型,通过生成文本推理步骤和可执行代码处理多源结构化知识问答,在实验中显著优于现有方法。
English: This paper introduces Pandora, a unified framework that leverages Python's Pandas API to align structured knowledge reasoning with LLMs, enabling natural language question answering across diverse sources through generated reasoning steps and executable code, outperforming existing methods in experiments.

Authors:Ching-Chun Chang, Isao Echizen
Title: The Chronicles of Foundation AI for Forensics of Multi-Agent Provenance
Abstract:
Provenance is the chronology of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi-agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigates the problem of tracking multi-agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta-information. At its core lies the notion of symbolic chronicles, representing signed and time-stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.
中文摘要:本研究提出一种基于符号编年史的时序系统,通过分析内容演变追踪多智能体溯源,无需依赖内部记忆状态或外部元数据即可实现可问责的协作式人工智能。
English Summary: This study proposes a chronological system using symbolic chronicles to track multi-agent provenance by analyzing content transformations, enabling accountable collaborative AI without relying on memory states or external metadata.

Authors:Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang
Title: DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
Abstract:
The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.
中文摘要:提出的基于分数蒸馏的模型融合范式(DMM)有效将多个专业文本生成图像模型整合为单一多功能模型,在解决参数冗余和存储问题的同时实现了可控的任意风格生成。
English Summary: The proposed score distillation-based model merging paradigm (DMM) effectively consolidates multiple specialized text-to-image models into a single versatile model, enabling controllable arbitrary-style generation while addressing parameter redundancy and storage issues.

Authors:Chenyang Zhu, Xing Zhang, Yuyang Sun, Ching-Chun Chang, Isao Echizen
Title: AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era
Abstract:
Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in https://flytweety.github.io/AnimeDL2M/.
中文摘要:针对扩散模型使动漫图像伪造检测日益困难的问题,本研究提出了首个大规模动漫专用基准数据集AnimeDL-2M和专门模型AniXplore,该模型在检测篡改动漫内容方面显著优于现有方法。
English Summary: Recent advances in diffusion models have made anime image forgery detection increasingly difficult, prompting the creation of AnimeDL-2M, the first large-scale anime-specific benchmark, and AniXplore, a novel model that outperforms existing methods in detecting manipulated anime content.

Authors:Mengdi Wang, Efe Bozkir, Enkelejda Kasneci
Title: Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study
Abstract:
Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.
中文: 近期AR/VR头显技术发展使其有望成为日常设备,但眼动追踪中的虹膜数据引发隐私担忧;研究通过系统评估多种模糊处理方法,发现虽无通用最优方案,但建议结合不同方法优势以实现隐私与实用性的最佳平衡。
English: Recent AR/VR headset advancements enable everyday use but raise privacy concerns due to iris data exposure, prompting a benchmark study of obfuscation methods that finds no universal solution but recommends tailored approaches for optimal privacy-utility balance.

Authors:Zuoli Tang, Junjie Ou, Kaiqin Hu, Chunwei Wu, Zhaoxin Huan, Chilin Fu, Xiaolu Zhang, Jun Zhou, Chenliang Li
Title: Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance
Abstract:
Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs' reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs' reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.
中文: 最新研究表明,尽管大型语言模型无需显式思维链提示也能有效推理,但在短路径提示下其推理能力会显著下降且不稳定,为此提出的指令引导和微调方法能有效解决这一冲突并保持高准确率。
English: Recent research reveals that while large language models can reason robustly without explicit chain-of-thought prompts, their performance significantly declines under short-path prompting, leading to proposed solutions through instruction guidance and fine-tuning that maintain high accuracy.

Authors:Yexing Xu, Longguang Wang, Minglin Chen, Sheng Ao, Li Li, Yulan Guo
Title: DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering
Abstract:
Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under sparse inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU. The project page is at: https://xuyx55.github.io/DropoutGS/.
Chinese: 本文提出DropoutGS方法,通过随机丢弃正则化减少过拟合和边缘引导分裂策略保留细节,显著提升了稀疏输入下3D高斯泼溅的性能,在多个基准数据集上达到最优效果。
English: This paper introduces DropoutGS, a method that enhances 3D Gaussian Splatting by incorporating Random Dropout Regularization to reduce overfitting and an Edge-guided Splatting Strategy to preserve details, achieving state-of-the-art performance with sparse inputs.

Authors:Qisai Liu, Zhanhong Jiang, Joshua R. Waite, Chao Liu, Aditya Balu, Soumik Sarkar
Title: Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion
Abstract:
Sequence modeling is a critical yet challenging task with wide-ranging applications, especially in time series forecasting for domains like weather prediction, temperature monitoring, and energy load forecasting. Transformers, with their attention mechanism, have emerged as state-of-the-art due to their efficient parallel training, but they suffer from quadratic time complexity, limiting their scalability for long sequences. In contrast, recurrent neural networks (RNNs) offer linear time complexity, spurring renewed interest in linear RNNs for more computationally efficient sequence modeling. In this work, we introduce BLUR (Bidirectional Linear Unit for Recurrent network), which uses forward and backward linear recurrent units (LRUs) to capture both past and future dependencies with high computational efficiency. BLUR maintains the linear time complexity of traditional RNNs, while enabling fast parallel training through LRUs. Furthermore, it offers provably stable training and strong approximation capabilities, making it highly effective for modeling long-term dependencies. Extensive experiments on sequential image and time series datasets reveal that BLUR not only surpasses transformers and traditional RNNs in accuracy but also significantly reduces computational costs, making it particularly suitable for real-world forecasting tasks. Our code is available here.
中文: BLUR提出了一种双向线性循环网络,通过线性时间复杂度高效捕捉过去与未来的依赖关系,在精度上超越了Transformer和传统循环神经网络,同时显著降低了实际预测任务中的计算成本。
English: BLUR introduces a bidirectional linear recurrent network that efficiently captures past and future dependencies with linear time complexity, outperforming transformers and traditional RNNs in accuracy while reducing computational costs for real-world forecasting tasks.

Authors:Yilin Ning, Yian Ma, Mingxuan Liu, Xin Li, Nan Liu
Title: seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness
Abstract:
Fairness in artificial intelligence (AI) prediction models is increasingly emphasized to support responsible adoption in high-stakes domains such as health care and criminal justice. Guidelines and implementation frameworks highlight the importance of both predictive accuracy and equitable outcomes. However, current fairness toolkits often evaluate classification performance disparities in isolation, with limited attention to other critical aspects such as calibration. To address these gaps, we present seeBias, an R package for comprehensive evaluation of model fairness and predictive performance. seeBias offers an integrated evaluation across classification, calibration, and other performance domains, providing a more complete view of model behavior. It includes customizable visualizations to support transparent reporting and responsible AI implementation. Using public datasets from criminal justice and healthcare, we demonstrate how seeBias supports fairness evaluations, and uncovers disparities that conventional fairness metrics may overlook. The R package is available on GitHub, and a Python version is under development.
中文: seeBias R软件包通过提供涵盖分类、校准和性能领域的综合评估及可定制可视化,弥补了当前AI公平性工具包的不足,在刑事司法和医疗应用中揭示了传统指标可能忽略的差异。
English: The seeBias R package addresses gaps in current AI fairness toolkits by providing comprehensive evaluation across classification, calibration, and performance domains with customizable visualizations, revealing disparities overlooked by conventional metrics in criminal justice and healthcare applications.

Authors:Romain de Laage, Peterson Yuhala, François-Xavier Wicht, Pascal Felber, Christian Cachin, Valerio Schiavoni
Title: Practical Secure Aggregation by Combining Cryptography and Trusted Execution Environments
Abstract:
Secure aggregation enables a group of mutually distrustful parties, each holding private inputs, to collaboratively compute an aggregate value while preserving the privacy of their individual inputs. However, a major challenge in adopting secure aggregation approaches for practical applications is the significant computational overhead of the underlying cryptographic protocols, e.g. fully homomorphic encryption. This overhead makes secure aggregation protocols impractical, especially for large datasets. In contrast, hardware-based security techniques such as trusted execution environments (TEEs) enable computation at near-native speeds, making them a promising alternative for reducing the computational burden typically associated with purely cryptographic techniques. Yet, in many scenarios, parties may opt for either cryptographic or hardware-based security mechanisms, highlighting the need for hybrid approaches. In this work, we introduce several secure aggregation architectures that integrate both cryptographic and TEE-based techniques, analyzing the trade-offs between security and performance.
Chinese: 本文提出融合密码学方法与可信执行环境的混合安全聚合架构,在解决纯密码学方法计算效率低下的同时,兼顾安全性与性能的平衡。
English: This paper introduces hybrid secure aggregation architectures that combine cryptographic methods with trusted execution environments to balance security and performance, addressing the computational inefficiency of purely cryptographic approaches.

Authors:Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye
Title: Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception
Abstract:
The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the alltype ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL frontend by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types,we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets. The code is available online.
中文: 本文首次建立了涵盖语音、声音、歌声和音乐的全面音频深度伪造检测基准,提出了一种参数高效的小波提示调优方法,在跨类型检测中实现了仅3.58%的平均等错误率最优性能。
English: This paper establishes the first comprehensive benchmark for all-type audio deepfake detection across speech, sound, singing voice, and music, introducing a parameter-efficient wavelet prompt tuning method that achieves state-of-the-art performance with just 3.58% average equal error rate.

Authors:Yifan Gao, Zihang Lin, Chuanbin Liu, Min Zhou, Tiezheng Ge, Bo Zheng, Hongtao Xie
Title: PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering
Abstract:
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
中文摘要:产品海报需精确文本渲染和主体保真度,本文提出的PosterMaker框架通过TextRenderNet实现超90%文本准确率,结合SceneGenNet与保真度反馈机制,采用两阶段训练策略显著超越现有基线。
English Summary: Product posters require precise text rendering and subject fidelity, which are addressed by the proposed PosterMaker framework using TextRenderNet for over 90% text accuracy and SceneGenNet with fidelity feedback, outperforming existing methods through a two-stage training strategy.

Authors:Huaguan Chen, Yang Liu, Hao Sun
Title: PINP: Physics-Informed Neural Predictor with latent estimation of fluid flows
Abstract:
Accurately predicting fluid dynamics and evolution has been a long-standing challenge in physical sciences. Conventional deep learning methods often rely on the nonlinear modeling capabilities of neural networks to establish mappings between past and future states, overlooking the fluid dynamics, or only modeling the velocity field, neglecting the coupling of multiple physical quantities. In this paper, we propose a new physics-informed learning approach that incorporates coupled physical quantities into the prediction process to assist with forecasting. Central to our method lies in the discretization of physical equations, which are directly integrated into the model architecture and loss function. This integration enables the model to provide robust, long-term future predictions. By incorporating physical equations, our model demonstrates temporal extrapolation and spatial generalization capabilities. Experimental results show that our approach achieves the state-of-the-art performance in spatiotemporal prediction across both numerical simulations and real-world extreme-precipitation nowcasting benchmarks.
Chinese: 本文提出了一种物理信息学习方法,将离散化的物理方程融入模型以改进流体动力学预测,在时空预测任务中实现了最先进的性能。
English: This paper introduces a physics-informed learning method that integrates discretized physical equations into the model to enhance fluid dynamics prediction, achieving state-of-the-art performance in spatiotemporal forecasting tasks.

Authors:César Leblanc, Lukas Picek, Benjamin Deneu, Pierre Bonnet, Maximilien Servajean, Rémi Palard, Alexis Joly
Title: Mapping biodiversity at very-high resolution in Europe
Abstract:
This paper describes a cascading multimodal pipeline for high-resolution biodiversity mapping across Europe, integrating species distribution modeling, biodiversity indicators, and habitat classification. The proposed pipeline first predicts species compositions using a deep-SDM, a multimodal model trained on remote sensing, climate time series, and species occurrence data at 50x50m resolution. These predictions are then used to generate biodiversity indicator maps and classify habitats with Pl@ntBERT, a transformer-based LLM designed for species-to-habitat mapping. With this approach, continental-scale species distribution maps, biodiversity indicator maps, and habitat maps are produced, providing fine-grained ecological insights. Unlike traditional methods, this framework enables joint modeling of interspecies dependencies, bias-aware training with heterogeneous presence-absence data, and large-scale inference from multi-source remote sensing inputs.
本文介绍了一种用于欧洲高分辨率生物多样性制图的级联多模态流程,该流程整合了物种分布建模、生物多样性指标和栖息地分类,可在洲际尺度上生成精细的生态洞察。
This paper presents a cascading multimodal pipeline for high-resolution biodiversity mapping in Europe, integrating species distribution modeling, biodiversity indicators, and habitat classification to produce detailed ecological insights at a continental scale.

Authors:Sugyeong Eo, Hyeonseok Moon, Evelyn Hayoon Zi, Chanjun Park, Heuiseok Lim
Title: Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning
Abstract:
Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). Despite improvements in reasoning, the approach introduces substantial computational overhead resulting from iterative agent interactions. Furthermore, engaging in unnecessary debates increases the risk of generating erroneous responses. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates debate based on the confidence score of the agent's initial response. Debate is activated only for queries requiring further deliberation, during which agents refine their outputs by referencing peer responses and associated confidence scores. Evaluations on benchmarks show that DOWN improves efficiency by up to six times while preserving or even outperforming the performance of existing methods. Further analysis indicates that DOWN effectively mitigates the risk of error propagation stemming from the unnecessary debate process. These findings demonstrate the effectiveness of our approach in delivering high-performance LLM solutions at a lower computational cost.
中文: 提出的DOWN框架根据置信度选择性启动多智能体辩论,通过减少不必要的讨论和错误传播风险,在保持或超越现有方法性能的同时实现了高达六倍的效率提升。
English: The proposed DOWN framework selectively activates multiagent debates based on confidence scores, achieving up to sixfold efficiency gains while maintaining or surpassing existing methods' performance by reducing unnecessary discussions and error risks.

Authors:Chengjie Lu, Pablo Valle, Jiahui Wu, Erblin Isaku, Hassan Sartaj, Aitor Arrieta, Shaukat Ali
Title: Foundation Models for Software Engineering of Cyber-Physical Systems: the Road Ahead
Abstract:
Foundation Models (FMs), particularly Large Language Models (LLMs), are increasingly used to support various software engineering activities (e.g., coding and testing). Their applications in the software engineering of Cyber-Physical Systems (CPSs) are also growing. However, research in this area remains limited. Moreover, existing studies have primarily focused on LLMs-only one type of FM-leaving ample opportunities to explore others, such as vision-language models. We argue that, in addition to LLMs, other FMs utilizing different data modalities (e.g., images, audio) and multimodal models (which integrate multiple modalities) hold great potential for supporting CPS software engineering, given that these systems process diverse data types. To address this, we present a research roadmap for integrating FMs into various phases of CPS software engineering, highlighting key research opportunities and challenges for the software engineering community.
中文: 本文摘要指出基础模型在信息物理系统软件工程中的应用日益增多但研究有限,提出了整合视觉语言和多模态等多样化模型的研究路线图,以应对生成结果正确性和模型不确定性等挑战。
English: This abstract highlights the growing but limited use of foundation models, especially large language models, in cyber-physical systems software engineering and proposes a research roadmap to explore diverse models like vision-language and multimodal types, addressing challenges such as correctness and uncertainty.

Authors:Chengjie Lu, Pablo Valle, Jiahui Wu, Erblin Isaku, Hassan Sartaj, Aitor Arrieta, Shaukat Ali
Title: Foundation Models for Software Engineering of Cyber-Physical Systems: the Road Ahead
Abstract:
FMs, particularly LLMs, are increasingly used to support various software engineering activities (e.g., coding and testing). Their applications in the software engineering of CPSs are also growing. However, research in this area remains limited. Moreover, existing studies have primarily focused on LLMs-only one type of FM-leaving ample opportunities to explore others, such as vision-language models. We argue that, in addition to LLMs, other FMs utilizing different data modalities (e.g., images, audio) and multimodal models (which integrate multiple modalities) hold great potential for supporting CPS software engineering, given that these systems process diverse data types. To address this, we present a research roadmap for integrating FMs into various phases of CPS software engineering, highlighting key research opportunities and challenges for the software engineering community. Moreover, we discuss the common challenges associated with applying FMs in this context, including the correctness of FM-generated artifacts, as well as the inherent uncertainty and hallucination associated with FMs. This roadmap is intended for researchers and practitioners in CPS software engineering, providing future research directions using FMs in this domain.
中文: 本文摘要指出基础模型在信息物理系统软件工程中的应用日益增多但研究有限,提出了整合视觉语言和多模态等多样化模型的研究路线图,以应对生成结果正确性和模型不确定性等挑战。
English: This abstract highlights the growing but limited use of foundation models, especially large language models, in cyber-physical systems software engineering and proposes a research roadmap to explore diverse models like vision-language and multimodal types, addressing challenges such as correctness and uncertainty.

Authors:Alessio Bucaioni, Martin Weyssow, Junda He, Yunbo Lyu, David Lo
Title: Artificial Intelligence for Software Architecture: Literature Review and the Road Ahead
Abstract:
This paper presents a forward-looking vision for artificial intelligence-driven software architecture that addresses longstanding challenges in design and evolution. Although artificial intelligence has achieved notable success in software engineering, its explicit application to software architecture remains under-explored. Traditional practices, heavily reliant on expert knowledge and complex trade-off reasoning, tend to be manual and error-prone, thereby compromising system quality and maintainability. Building on recent advances, we examine how artificial intelligence can automate architectural design, support quantitative trade-off analyses, and continuously update architectural documentation. Our approach combines a systematic review of state-of-the-art applications with insights from industry practitioners. The resulting roadmap outlines 14 current artificial intelligence contributions to software architecture, identifies six artificial intelligence-specific challenges in supporting architectural tasks, and reveals six avenues for future improvement, charting a course for future research and practical implementations.
中文: 本文提出了一种人工智能驱动的软件架构愿景,旨在自动化设计、优化权衡分析并维护文档,以解决当前不足并规划未来研究方向。
English: This paper proposes an AI-driven vision for software architecture to automate design, enhance trade-off analysis, and maintain documentation, addressing current gaps and outlining future research directions.

Authors:Akis Nousias, Efklidis Katsaros, Evangelos Syrmos, Panagiotis Radoglou-Grammatikis, Thomas Lagkas, Vasileios Argyriou, Ioannis Moscholios, Evangelos Markakis, Sotirios Goudos, Panagiotis Sarigiannidis
Title: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs
Abstract:
Malware detection is increasingly challenged by evolving techniques like obfuscation and polymorphism, limiting the effectiveness of traditional methods. Meanwhile, the widespread adoption of software containers has introduced new security challenges, including the growing threat of malicious software injection, where a container, once compromised, can serve as entry point for further cyberattacks. In this work, we address these security issues by introducing a method to identify compromised containers through machine learning analysis of their file systems. We cast the entire software containers into large RGB images via their tarball representations, and propose to use established Convolutional Neural Network architectures on a streaming, patch-based manner. To support our experiments, we release the COSOCO dataset--the first of its kind--containing 3364 large-scale RGB images of benign and compromised software containers at https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset. Our method detects more malware and achieves higher F1 and Recall scores than all individual and ensembles of VirusTotal engines, demonstrating its effectiveness and setting a new standard for identifying malware-compromised software containers.
中文摘要:本研究提出一种机器学习方法,将软件容器转换为RGB图像,并利用卷积神经网络检测受恶意软件感染的容器,其效果优于现有VirusTotal引擎。
English Summary: This study introduces a machine learning method that converts software containers into RGB images and uses convolutional neural networks to detect malware-compromised containers more effectively than existing VirusTotal engines.

Authors:Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda
Title: GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
Abstract:
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02, the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at Github.
中文摘要:GPTAQ是一种无需微调的新型量化方法,通过非对称校准和并行化技术有效减少累积量化误差,仅需比GPTQ多20行代码即可高效压缩405B等大型Transformer模型。
English Summary: GPTAQ is a novel finetuning-free quantization method that reduces accumulated quantization errors through asymmetric calibration and parallelization techniques, enabling efficient compression of large transformers like 405B models with minimal code additions.

Authors:Yiyang Shen, Kun Zhou, He Wang, Yin Yang, Tianjia Shao
Title: High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
Abstract:
Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in terms of high-quality reconstruction results, robust generalization, and good efficiency.
中文: GS-RGBN提出了一种混合体素-高斯表示法,通过结合RGB和法线特征来消除几何模糊性并优化高斯结构,从而实现了从单视图图像生成高保真3D物体的突破。
English: GS-RGBN introduces a hybrid Voxel-Gaussian representation that mitigates geometric ambiguity and structures Gaussians to enable high-fidelity 3D object generation from single-view images.

Authors:Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao
Title: Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval
Abstract:
Traditional sparse and dense retrieval methods struggle to leverage general world knowledge and often fail to capture the nuanced features of queries and products. With the advent of large language models (LLMs), industrial search systems have started to employ LLMs to generate identifiers for product retrieval. Commonly used identifiers include (1) static/semantic IDs and (2) product term sets. The first approach requires creating a product ID system from scratch, missing out on the world knowledge embedded within LLMs. While the second approach leverages this general knowledge, the significant difference in word distribution between queries and products means that product-based identifiers often do not align well with user search queries, leading to missed product recalls. Furthermore, when queries contain numerous attributes, these algorithms generate a large number of identifiers, making it difficult to assess their quality, which results in low overall recall efficiency. To address these challenges, this paper introduces a novel e-commerce retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM employs joint training on text information from both queries and products to generate shared text identifier codes, effectively bridging the gap between queries and products. This approach not only enhances the connection between queries and products but also improves inference efficiency. The model uses a co-alignment strategy to generate codes optimized for maximizing retrieval efficiency. Additionally, it introduces a query-product scoring mechanism to compare product values across different codes, further boosting retrieval efficiency. Extensive offline and online A/B testing demonstrates that GRAM significantly outperforms traditional models and the latest generative retrieval models, confirming its effectiveness and practicality.
This paper introduces the Generative Retrieval and Alignment Model (GRAM), a novel e-commerce retrieval paradigm that generates shared text identifier codes through joint training on query and product text, effectively bridging their gap while improving retrieval efficiency and outperforming traditional and modern generative models.
English Summary:

Authors:Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, Jun Liu
Title: POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Abstract:
Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/
中文: 本文提出POPEN框架,通过基于偏好的优化和集成方法,有效解决了现有LVLM推理分割中的不精确分割和文本幻觉问题,实现了最低的文本错误率和最高的分割精度,达到当前最优性能。
English: This paper introduces POPEN, a novel framework that addresses imprecise segmentation and hallucinations in LVLM-based reasoning segmentation by incorporating preference-based optimization and ensemble methods, achieving state-of-the-art performance with minimal text errors and the highest segmentation accuracy.

Authors:Zherui Zhang, Changwei Wang, Rongtao Xu, Wenhao Xu, Shibiao Xu, Yu Zhang, Li Guo
Title: CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation
Abstract:
Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit{\textbf{i.)}} Significant efficiency advantages resulting from altering the generator training paradigm; \textit{\textbf{ii.)}} Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit{\textbf{iii.)}} Remarkable transferability of data-free learned representations demonstrated in downstream tasks.
中文摘要:CAE-DFKD通过引入类别感知嵌入方法,有效解决了传统图像级无数据知识蒸馏的局限性,在保持高效训练的同时实现了优越的识别性能和显著提升的特征迁移能力。
English Summary: CAE-DFKD introduces category-aware embedding to overcome limitations of image-level DFKD methods, demonstrating superior efficiency, competitive recognition performance, and enhanced transferability in downstream tasks.

Authors:Maximilian Egger, Rüdiger Urbanke, Rawad Bitar
Title: Federated One-Shot Learning with Data Privacy and Objective-Hiding
Abstract:
Privacy in federated learning is crucial, encompassing two key aspects: safeguarding the privacy of clients' data and maintaining the privacy of the federator's objective from the clients. While the first aspect has been extensively studied, the second has received much less attention. We present a novel approach that addresses both concerns simultaneously, drawing inspiration from techniques in knowledge distillation and private information retrieval to provide strong information-theoretic privacy guarantees. Traditional private function computation methods could be used here; however, they are typically limited to linear or polynomial functions. To overcome these constraints, our approach unfolds in three stages. In stage 0, clients perform the necessary computations locally. In stage 1, these results are shared among the clients, and in stage 2, the federator retrieves its desired objective without compromising the privacy of the clients' data. The crux of the method is a carefully designed protocol that combines secret-sharing-based multi-party computation and a graph-based private information retrieval scheme. We show that our method outperforms existing tools from the literature when properly adapted to this setting.
Chinese: 本文提出了一种新颖的三阶段联邦学习方法,通过结合知识蒸馏和私有信息检索技术,在保护客户端数据隐私的同时隐藏聚合器的目标,其性能优于现有方法。
English: This paper introduces a novel three-stage federated learning approach that simultaneously protects client data privacy and conceals the federator's objective by integrating knowledge distillation with private information retrieval, demonstrating superior performance over existing methods.

Authors:Yuwei Jin, Zichang He, Tianyi Hao, David Amaro, Swamit Tannu, Ruslan Shaydulin, Marco Pistoia
Title: Iceberg Beyond the Tip: Co-Compilation of a Quantum Error Detection Code and a Quantum Algorithm
Abstract:
The rapid progress in quantum hardware is expected to make them viable tools for the study of quantum algorithms in the near term. The timeline to useful algorithmic experimentation can be accelerated by techniques that use many noisy shots to produce an accurate estimate of the observable of interest. One such technique is to encode the quantum circuit using an error detection code and discard the samples for which an error has been detected. An underexplored property of error-detecting codes is the flexibility in the circuit encoding and fault-tolerant gadgets, which enables their co-optimization with the algorthmic circuit. However, standard circuit optimization tools cannot be used to exploit this flexibility as optimization must preserve the fault-tolerance of the gadget. In this work, we focus on the $[[k+2, k, 2]]$ Iceberg quantum error detection code, which is tailored to trapped-ion quantum processors. We design new flexible fault-tolerant gadgets for the Iceberg code, which we then co-optimize with the algorithmic circuit for the quantum approximate optimization algorithm (QAOA) using tree search. By co-optimizing the QAOA circuit and the Iceberg gadgets, we achieve an improvement in QAOA success probability from $44\%$ to $65\%$ and an increase in post-selection rate from $4\%$ to $33\%$ at 22 algorithmic qubits, utilizing 330 algorithmic two-qubit gates and 744 physical two-qubit gates on the Quantinuum H2-1 quantum computer, compared to the previous state-of-the-art hardware demonstration. Furthermore, we demonstrate better-than-unencoded performance for up to 34 algorithmic qubits, employing 510 algorithmic two-qubit gates and 1140 physical two-qubit gates.
中文: 本研究为冰山量子纠错码设计了灵活容错组件,通过与量子近似优化算法电路协同优化,在离子阱量子处理器上显著提高了成功概率和后选择率。
English: This study introduces flexible fault-tolerant gadgets for the Iceberg quantum error detection code, co-optimizing them with QAOA circuits to significantly enhance success probability and post-selection rates on trapped-ion quantum processors.

Authors:Jiaxin Hong, Sixu Chen, Shuoyang Sun, Hongyao Yu, Hao Fang, Yuqi Tan, Bin Chen, Shuhan Qi, Jiawei Li
Title: GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion
Abstract:
As 3D Gaussian Splatting (3DGS) emerges as a breakthrough in scene representation and novel view synthesis, its rapid adoption in safety-critical domains (e.g., autonomous systems, AR/VR) urgently demands scrutiny of potential security vulnerabilities. This paper presents the first systematic study of backdoor threats in 3DGS pipelines. We identify that adversaries may implant backdoor views to induce malicious scene confusion during inference, potentially leading to environmental misperception in autonomous navigation or spatial distortion in immersive environments. To uncover this risk, we propose GuassTrap, a novel poisoning attack method targeting 3DGS models. GuassTrap injects malicious views at specific attack viewpoints while preserving high-quality rendering in non-target views, ensuring minimal detectability and maximizing potential harm. Specifically, the proposed method consists of a three-stage pipeline (attack, stabilization, and normal training) to implant stealthy, viewpoint-consistent poisoned renderings in 3DGS, jointly optimizing attack efficacy and perceptual realism to expose security risks in 3D rendering. Extensive experiments on both synthetic and real-world datasets demonstrate that GuassTrap can effectively embed imperceptible yet harmful backdoor views while maintaining high-quality rendering in normal views, validating its robustness, adaptability, and practical applicability.
中文: 本研究首次系统性地探讨了3D高斯溅射中的后门攻击威胁,提出GuassTrap这一隐蔽投毒方法,能在保持正常视角高质量渲染的同时植入恶意视图以引发场景混淆。
English: This study introduces the first systematic investigation of backdoor attacks in 3D Gaussian Splatting pipelines, proposing GuassTrap—a stealthy poisoning method that implants malicious views to induce scene confusion while maintaining high-quality rendering in normal views.

Authors:Linjuan Wu, Haoran Wei, Huan Lin, Tianhao Li, Baosong Yang, Fei Huang, Weiming Lu
Title: Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training
Abstract:
Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.
Chinese Summary: 跨语言上下文预训练(CrossIC-PT)通过利用语义相关的双语文本进行简单下一词预测,有效提升了大型语言模型的多语言性能,在多个模型和语言上均实现了显著准确率提升。
English Summary: Cross-lingual In-context Pre-training (CrossIC-PT) enhances multilingual performance in LLMs by leveraging semantically related bilingual texts through next-word prediction, achieving notable accuracy improvements across multiple models and languages.

Authors:Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia
Title: VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Abstract:
Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85\% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.
大型视觉语言模型在人工智能应用中至关重要,但因其基于令牌的图像处理方式而效率低下;提出的VCM框架通过自监督视觉概念建模解决了这一问题,在保持多任务性能的同时显著降低了计算成本。
Large Vision-Language Models (LVLMs) are essential for AI applications but inefficient due to token-level image processing; the proposed VCM framework addresses this by enabling self-supervised visual concept modeling, drastically cutting computational costs while preserving performance across tasks.

Authors:Taoyu Su, Jiawei Sheng, Duohe Ma, Xiaodong Li, Juwei Yue, Mengxiao Song, Yingkai Tang, Tingwen Liu
Title: Mitigating Modality Bias in Multi-modal Entity Alignment from a Causal Perspective
Abstract:
Multi-Modal Entity Alignment (MMEA) aims to retrieve equivalent entities from different Multi-Modal Knowledge Graphs (MMKGs), a critical information retrieval task. Existing studies have explored various fusion paradigms and consistency constraints to improve the alignment of equivalent entities, while overlooking that the visual modality may not always contribute positively. Empirically, entities with low-similarity images usually generate unsatisfactory performance, highlighting the limitation of overly relying on visual features. We believe the model can be biased toward the visual modality, leading to a shortcut image-matching task. To address this, we propose a counterfactual debiasing framework for MMEA, termed CDMEA, which investigates visual modality bias from a causal perspective. Our approach aims to leverage both visual and graph modalities to enhance MMEA while suppressing the direct causal effect of the visual modality on model predictions. By estimating the Total Effect (TE) of both modalities and excluding the Natural Direct Effect (NDE) of the visual modality, we ensure that the model predicts based on the Total Indirect Effect (TIE), effectively utilizing both modalities and reducing visual modality bias. Extensive experiments on 9 benchmark datasets show that CDMEA outperforms 14 state-of-the-art methods, especially in low-similarity, high-noise, and low-resource data scenarios.
中文:提出的CDMEA框架通过反事实分析抑制视觉模态的直接因果影响,同时有效利用视觉和结构信息来解决多模态实体对齐中的视觉偏差问题,在多种数据集上实现了优越性能。
English: The proposed CDMEA framework addresses visual modality bias in Multi-Modal Entity Alignment by employing counterfactual analysis to suppress direct visual effects while effectively leveraging both visual and structural information, achieving superior performance across diverse datasets.

Authors:Tongrui Su, Qingbin Li, Shengyu Zhu, Wei Chen, Xueqi Cheng
Title: A Simple DropConnect Approach to Transfer-based Targeted Attack
Abstract:
We study the problem of transfer-based black-box attack, where adversarial samples generated using a single surrogate model are directly applied to target models. Compared with untargeted attacks, existing methods still have lower Attack Success Rates (ASRs) in the targeted setting, i.e., the obtained adversarial examples often overfit the surrogate model but fail to mislead other models. In this paper, we hypothesize that the pixels or features in these adversarial examples collaborate in a highly dependent manner to maximize the success of an adversarial attack on the surrogate model, which we refer to as perturbation co-adaptation. Then, we propose to Mitigate perturbation Co-adaptation by DropConnect (MCD) to enhance transferability, by creating diverse variants of surrogate model at each optimization iteration. We conduct extensive experiments across various CNN- and Transformer-based models to demonstrate the effectiveness of MCD. In the challenging scenario of transferring from a CNN-based model to Transformer-based models, MCD achieves 13% higher average ASRs compared with state-of-the-art baselines. MCD boosts the performance of self-ensemble methods by bringing in more diversification across the variants while reserving sufficient semantic information for each variant. In addition, MCD attains the highest performance gain when scaling the compute of crafting adversarial examples.
中文: 本文提出MCD方法,通过DropConnect缓解对抗样本中的扰动共适应现象,有效提升黑盒迁移攻击的跨模型成功率,在CNN与Transformer模型间实现了13%的平均攻击成功率提升。
English: This paper introduces MCD, a method that mitigates perturbation co-adaptation in transfer-based black-box attacks by using DropConnect to enhance adversarial example transferability, achieving significantly higher attack success rates across diverse models.

Authors:Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen
Title: Fast Autoregressive Models for Continuous Latent Generation
Abstract:
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.
Chinese: 快速自回归模型(FAR)采用轻量级快捷头替代掩码自回归模型中的扩散头,在保持图像生成质量竞争力的同时,实现了高效少步采样,推理速度提升2.3倍。
English: The Fast AutoRegressive model (FAR) introduces a lightweight shortcut head to replace the computationally intensive diffusion head in masked autoregressive models, enabling efficient few-step sampling while maintaining competitive image generation quality and achieving 2.3× faster inference.

Authors:Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada
Title: Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Abstract:
Pre-trained deep learning models, known as foundation models, have become essential building blocks in machine learning domains such as natural language processing and image domains. This trend has extended to respiratory and heart sound models, which have demonstrated effectiveness as off-the-shelf feature extractors. However, their evaluation benchmarking has been limited, resulting in incompatibility with state-of-the-art (SOTA) performance, thus hindering proof of their effectiveness. This study investigates the practical effectiveness of off-the-shelf audio foundation models by comparing their performance across four respiratory and heart sound tasks with SOTA fine-tuning results. Experiments show that models struggled on two tasks with noisy data but achieved SOTA performance on the other tasks with clean data. Moreover, general-purpose audio models outperformed a respiratory sound model, highlighting their broader applicability. With gained insights and the released code, we contribute to future research on developing and leveraging foundation models for respiratory and heart sounds.
中文: 预训练音频基础模型在呼吸和心音任务中表现不一,在干净数据上达到顶尖性能但在噪声数据上表现不佳,且通用模型比专用模型展现出更广泛的适用性。
English: Pre-trained audio foundation models show varying effectiveness in respiratory and heart sound tasks, achieving state-of-the-art performance on clean data but struggling with noisy data, while general-purpose models demonstrate broader applicability than specialized ones.

Authors:Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao Yang
Title: DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
Abstract:
This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
中文摘要:华为针对ICDAR2025竞赛提出的解决方案,采用基于大视觉语言模型的统一训练框架,结合多任务学习与感知思维链技术,并通过贝叶斯解码与后处理优化,实现了同时支持OCR与免OCR的文档图像翻译任务。
English Summary: Huawei's solution for the ICDAR2025 competition introduces a unified training framework using large vision-language models with multi-task learning and perceptual chain-of-thought, enhanced by Bayesian decoding and post-processing to handle both OCR-based and OCR-free document translation tasks.

Authors:Yanan Zhao, Feng Ji, Kai Zhao, Xuhao Li, Qiyu Kang, Wenfei Liang, Yahya Alkhatib, Xingchao Jian, Wee Peng Tay
Title: Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks
Abstract:
Graph Contrastive Learning (GCL) has recently made progress as an unsupervised graph representation learning paradigm. GCL approaches can be categorized into augmentation-based and augmentation-free methods. The former relies on complex data augmentations, while the latter depends on encoders that can generate distinct views of the same input. Both approaches may require negative samples for training. In this paper, we introduce a novel augmentation-free GCL framework based on graph neural diffusion models. Specifically, we utilize learnable encoders governed by Fractional Differential Equations (FDE). Each FDE is characterized by an order parameter of the differential operator. We demonstrate that varying these parameters allows us to produce learnable encoders that generate diverse views, capturing either local or global information, for contrastive learning. Our model does not require negative samples for training and is applicable to both homophilic and heterophilic datasets. We demonstrate its effectiveness across various datasets, achieving state-of-the-art performance.
中文: 本文提出了一种基于图神经扩散模型的新型免增强图对比学习框架,利用分数阶微分方程生成多样化视图,无需负样本训练,并在同质性与异质性数据集上均实现了最优性能。
English: This paper introduces a novel augmentation-free graph contrastive learning framework using graph neural diffusion models with Fractional Differential Equations, which generates diverse views without negative samples and achieves state-of-the-art performance on both homophilic and heterophilic datasets.

Authors:Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John C. S. Lui, Adam Wierman
Title: Fusing Reward and Dueling Feedback in Stochastic Bandits
Abstract:
This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.
Chinese: 本文提出了两种融合绝对奖励和相对对决反馈的随机赌博机算法,其中分解融合方法通过自适应选择更有效的反馈类型,实现了接近理论下界的遗憾性能。
English: This paper introduces two fusion algorithms that combine absolute reward and relative dueling feedback in stochastic bandits, with the decomposition approach achieving near-optimal regret by adaptively selecting the more effective feedback type in each round.

Authors:Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, Mengmeng Wang
Title: AffordanceSAM: Segment Anything Once More in Affordance Grounding
Abstract:
Building a generalized affordance grounding model to identify actionable regions on objects is vital for real-world applications. Existing methods to train the model can be divided into weakly and fully supervised ways. However, the former method requires a complex training framework design and can not infer new actions without an auxiliary prior. While the latter often struggle with limited annotated data and components trained from scratch despite being simpler. This study focuses on fully supervised affordance grounding and overcomes its limitations by proposing AffordanceSAM, which extends SAM's generalization capacity in segmentation to affordance grounding. Specifically, we design an affordance-adaption module and curate a coarse-to-fine annotated dataset called C2F-Aff to thoroughly transfer SAM's robust performance to affordance in a three-stage training manner. Experimental results confirm that AffordanceSAM achieves state-of-the-art (SOTA) performance on the AGD20K benchmark and exhibits strong generalized capacity.
Chinese: 本研究提出AffordanceSAM模型,通过全监督方式将SAM的分割泛化能力扩展至可供性接地任务,利用三阶段训练策略克服数据限制,在AGD20K基准测试中实现了最优性能。
English: This study introduces AffordanceSAM, a fully supervised model that extends SAM's segmentation capabilities to affordance grounding, overcoming data limitations and achieving state-of-the-art performance on the AGD20K benchmark through a three-stage training approach.

Authors:Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu
Title: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Abstract:
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.
中文: Eagle 2.5推出了前沿视觉语言模型系列,通过自动降级采样和图像区域保护等创新技术,在长上下文多模态基准测试中实现突破性进展,其80亿参数模型在Video-MME上获得72.4%得分,性能媲美顶级商业模型。
English: Eagle 2.5 introduces a family of vision-language models with innovative techniques like Automatic Degrade Sampling and Image Area Preservation, achieving state-of-the-art performance in long-context multimodal benchmarks and matching top-tier models with its 8B variant scoring 72.4% on Video-MME.

Authors:Hongli Peng, Xiaoqi Li, Wenkai Li
Title: Mining Characteristics of Vulnerable Smart Contracts Across Lifecycle Stages
Abstract:
Smart contracts are the cornerstone of decentralized applications and financial protocols, which extend the application of digital currency transactions. The applications and financial protocols introduce significant security challenges, resulting in substantial economic losses. Existing solutions predominantly focus on code vulnerabilities within smart contracts, accounting for only 50% of security incidents. Therefore, a more comprehensive study of security issues related to smart contracts is imperative. The existing empirical research realizes the static analysis of smart contracts from the perspective of the lifecycle and gives the corresponding measures for each stage. However, they lack the characteristic analysis of vulnerabilities in each stage and the distinction between the vulnerabilities. In this paper, we present the first empirical study on the security of smart contracts throughout their lifecycle, including deployment and execution, upgrade, and destruction stages. It delves into the security issues at each stage and provides at least seven feature descriptions. Finally, utilizing these seven features, five machine-learning classification models are used to identify vulnerabilities at different stages. The classification results reveal that vulnerable contracts exhibit distinct transaction features and ego network properties at various stages.
中文摘要:本文首次对智能合约全生命周期安全进行实证研究,通过识别七个漏洞特征并应用机器学习模型,揭示了不同阶段漏洞具有独特的交易模式和网络属性。
English Summary: This paper presents the first empirical study analyzing smart contract security across their entire lifecycle, identifying seven distinct vulnerability features and using machine learning models to demonstrate that vulnerabilities manifest unique transaction and network characteristics at different stages.

Authors:Tingyang Chen, Cong Fu, Xiangyu Ke, Yunjun Gao, Yabo Ni, Anxiang Zeng
Title: Stitching Inner Product and Euclidean Metrics for Topology-aware Maximum Inner Product Search
Abstract:
Maximum Inner Product Search (MIPS) is a fundamental challenge in machine learning and information retrieval, particularly in high-dimensional data applications. Existing approaches to MIPS either rely solely on Inner Product (IP) similarity, which faces issues with local optima and redundant computations, or reduce the MIPS problem to the Nearest Neighbor Search under the Euclidean metric via space projection, leading to topology destruction and information loss. Despite the divergence of the two paradigms, we argue that there is no inherent binary opposition between IP and Euclidean metrics. By stitching IP and Euclidean in the design of indexing and search algorithms, we can significantly enhance MIPS performance. Specifically, this paper explores the theoretical and empirical connections between these two metrics from the MIPS perspective. Our investigation, grounded in graph-based search, reveals that different indexing and search strategies offer distinct advantages for MIPS, depending on the underlying data topology. Building on these insights, we introduce a novel graph-based index called Metric-Amphibious Graph (MAG) and a corresponding search algorithm, Adaptive Navigation with Metric Switch (ANMS). To facilitate parameter tuning for optimal performance, we identify three statistical indicators that capture essential data topology properties and correlate strongly with parameter tuning. Extensive experiments on 12 real-world datasets demonstrate that MAG outperforms existing state-of-the-art methods, achieving up to 4x search speedup while maintaining adaptability and scalability.
中文: 本文提出了一种新颖的图索引结构和搜索算法,通过融合内积与欧氏距离度量,在保持适应性的同时显著提升了最大内积搜索性能,实现了高达4倍的搜索加速。
English: This paper introduces a novel graph-based index and search algorithm that integrates both Inner Product and Euclidean metrics to significantly enhance Maximum Inner Product Search performance, achieving up to 4x speedup while maintaining adaptability across diverse data topologies.

Authors:Kangwei Xu, Bing Li, Grace Li Zhang, Ulf Schlichtmann
Title: HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis
Abstract:
In high-level synthesis (HLS), C/C++ programs with synthesis directives are used to generate circuits for FPGA implementations. However, hardware-specific and platform-dependent characteristics in these implementations can introduce behavioral discrepancies between the original C/C++ programs and the circuits after high-level synthesis. Existing methods for testing behavioral discrepancies in HLS are still immature, and the testing workflow requires significant human efforts. To address this challenge, we propose HLSTester, a large language model (LLM) aided testing framework that efficiently detects behavioral discrepancies in HLS. To mitigate hallucinations in LLMs and enhance prompt quality, the testbenches for original C/C++ programs are leveraged to guide LLMs in generating HLS-compatible testbenches, effectively eliminating certain traditional C/C++ constructs that are incompatible with HLS tools. Key variables are pinpointed through a backward slicing technique in both C/C++ and HLS programs to monitor their runtime spectra, enabling an in-depth analysis of the discrepancy symptoms. To reduce test time, a testing input generation mechanism is introduced to integrate dynamic mutation with insights from an LLM-based progressive reasoning chain. In addition, repetitive hardware testing is skipped by a redundancy-aware filtering technique for the generated test inputs. Experimental results demonstrate that the proposed LLM-aided testing framework significantly accelerates the testing workflow while achieving higher testbench simulation pass rates compared with the traditional method and the direct use of LLMs on the same HLS programs.
中文: HLSTester是一种基于大语言模型的测试框架,通过生成兼容的测试平台和优化测试输入,有效检测高层次综合中的行为差异,显著加速测试流程并提高仿真通过率。
English: HLSTester is an LLM-aided testing framework that efficiently detects behavioral discrepancies in high-level synthesis by generating compatible testbenches and optimizing test input generation, significantly accelerating the testing workflow with higher simulation pass rates.

Authors:Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, Xueqi Cheng
Title: a1: Steep Test-time Scaling Law via Environment Augmented Generation
Abstract:
Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.
中文: 环境增强生成(EAG)框架通过整合实时环境反馈、动态分支探索和基于经验的学习,有效解决大语言模型在复杂任务中的幻觉和逻辑错误问题,实现了顶尖性能,其独特扩展模式随任务复杂性提升而优势愈加显著。
English: The proposed Environment Augmented Generation (EAG) framework enhances LLM reasoning by integrating real-time environmental feedback, dynamic branch exploration, and experience-based learning to address hallucinations and logical errors in complex tasks, achieving state-of-the-art performance with a distinctive scaling pattern that improves with task complexity.

Authors:Weijun Zhuang, Qizhang Li, Xin Li, Ming Liu, Xiaopeng Hong, Feng Gao, Fan Yang, Wangmeng Zuo
Title: Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection
Abstract:
Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.
中文摘要:Grounding-MD是一种创新的视频语言预训练框架,专为开放世界时刻检测设计,通过跨模态融合统一时序动作检测与时刻检索任务,并在多个基准测试中实现了最先进的性能。
English Summary: Grounding-MD is a novel video-language pre-training framework designed for open-world moment detection, which unifies temporal action detection and moment retrieval through cross-modality fusion and achieves state-of-the-art performance across multiple benchmarks.

Authors:Man Ho Lam, Chaozheng Wang, Jen-tse Huang, Michael R. Lyu
Title: CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Abstract:
Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains insufficiently explored. We present CodeCrash, a comprehensive stress-testing benchmark comprising 1,279 questions from two established datasets, CruxEval and LiveCodeBench, designed to evaluate model reasoning reliability under non-standard coding environments. We systematically evaluate 17 LLMs across input and output prediction tasks using direct and Chain-of-Thought prompting approaches, revealing that LLMs are particularly vulnerable to disorganized code and overly reliant on natural language cues: aggregated structural perturbations result in over 14 percentage points (pp) of degradation, while textual perturbations cause a performance drop of over 11 pp. Moreover, self-reflective mechanisms in state-of-the-art reasoning models significantly increase token usage by 2-3 times, reduce output confidence, and even lead to catastrophic reasoning failures when faced with targeted perturbations -- for instance, QwQ-32B generates over 12,000 redundant tokens under reasoning-level perturbations. CodeCrash provides a rigorous benchmark for evaluating robustness in code understanding, guiding future research toward more reliable and resilient LLMs in code reasoning. The benchmark code, perturbed datasets, and full leaderboard are publicly available at https://cuhk-arise.github.io/CodeCrash/ .
中文: CodeCrash基准测试评估了17个大语言模型在代码推理中的鲁棒性,发现它们对结构性和文本性干扰极为敏感,性能显著下降,并为开发更可靠的模型提供了资源。
English: CodeCrash is a benchmark that tests the robustness of 17 large language models in code reasoning, revealing their vulnerabilities to structural and textual perturbations and significant performance drops, while providing a resource for developing more reliable models.

Authors:Chenxuan Liu, He Li, Zongze Li, Shuai Wang, Wei Xu, Kejiang Ye, Derrick Wing Kwan Ng, Chengzhong Xu
Title: Green Robotic Mixed Reality with Gaussian Splatting
Abstract:
Realizing green communication in robotic mixed reality (RoboMR) systems presents a challenge, due to the necessity of uploading high-resolution images at high frequencies through wireless channels. This paper proposes Gaussian splatting (GS) RoboMR (GSRMR), which achieves a lower energy consumption and makes a concrete step towards green RoboMR. The crux to GSRMR is to build a GS model which enables the simulator to opportunistically render a photo-realistic view from the robot's pose, thereby reducing the need for excessive image uploads. Since the GS model may involve discrepancies compared to the actual environments, a GS cross-layer optimization (GSCLO) framework is further proposed, which jointly optimizes content switching (i.e., deciding whether to upload image or not) and power allocation across different frames. The GSCLO problem is solved by an accelerated penalty optimization (APO) algorithm. Experiments demonstrate that the proposed GSRMR reduces the communication energy by over 10x compared with RoboMR. Furthermore, the proposed GSRMR with APO outperforms extensive baseline schemes, in terms of peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM).
中文: 本文提出GS RoboMR (GSRMR)方法,利用高斯泼溅技术减少机器人混合现实中高频图像上传,通过跨层优化框架实现能效提升,通信能耗降低超过10倍。
English: This paper introduces GS RoboMR (GSRMR), a method that uses Gaussian splatting to reduce energy consumption in robotic mixed reality systems by minimizing high-frequency image uploads, supported by a cross-layer optimization framework and achieving over 10x energy savings.

Authors:Hanyu Zhang, Zhen Xing, Wenxuan Yang, Chenxi Ma, Weimin Tan, Bo Yan
Title: Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning
Abstract:
As transfer learning models and datasets grow larger, efficient adaptation and storage optimization have become critical needs. Coreset selection addresses these challenges by identifying and retaining the most informative samples, constructing a compact subset for target domain training. However, current methods primarily rely on instance-level difficulty assessments, overlooking crucial category-level characteristics and consequently under-representing minority classes. To overcome this limitation, we propose Non-Uniform Class-Wise Coreset Selection (NUCS), a novel framework that integrates both class-level and instance-level criteria. NUCS automatically allocates data selection budgets for each class based on intrinsic category difficulty and adaptively selects samples within optimal difficulty ranges. By explicitly incorporating category-specific insights, our approach achieves a more balanced and representative coreset, addressing key shortcomings of prior methods. Comprehensive theoretical analysis validates the rationale behind adaptive budget allocation and sample selection, while extensive experiments across 14 diverse datasets and model architectures demonstrate NUCS's consistent improvements over state-of-the-art methods, achieving superior accuracy and computational efficiency. Notably, on CIFAR100 and Food101, NUCS matches full-data training accuracy while retaining just 30% of samples and reducing computation time by 60%. Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
中文摘要:提出的非均匀类感知核心集选择(NUCS)框架通过整合类别级和实例级标准,解决了现有方法忽视类别特征的问题,在仅使用30%样本的情况下实现了与全数据相当的准确率,并在多个数据集上验证了其优越性能。
English Summary: The proposed Non-Uniform Class-Wise Coreset Selection (NUCS) framework overcomes limitations of existing methods by integrating class-level and instance-level criteria to create balanced coresets, achieving full-data accuracy with only 30% of samples while reducing computation time by 60% across multiple datasets.

Authors:Wenxuan Yang, Qingqu Wei, Chenxi Ma, Weimin Tan, Bo Yan
Title: Scaling Laws for Data-Efficient Visual Transfer Learning
Abstract:
Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data? 2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.
中文摘要:本文首次提出了视觉迁移学习中数据高效扩展规律的实用框架,通过蒸馏边界理论揭示了关键转折点:在数据稀缺时蒸馏模型显著优于非蒸馏模型,而当预训练数据超过临界阈值后,非蒸馏模型反而表现更优。
English Summary: This paper introduces the first practical framework for data-efficient scaling laws in visual transfer learning, establishing the distillation boundary theory which reveals a critical inflection point where distilled models outperform non-distilled counterparts in data-scarce conditions but are surpassed when sufficient task-specific data becomes available.

Authors:João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, Timothy J. O'Donnell
Title: Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
Abstract:
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
中文摘要:本研究提出了一种基于序列蒙特卡洛的受控文本生成架构,使语言模型能够有效融入领域特定约束,在多个挑战性领域中展现出优于大型模型的性能表现。
English Summary: This research introduces a sequential Monte Carlo (SMC) architecture for controlled text generation that enables language models to efficiently incorporate domain-specific constraints, demonstrating superior performance over larger models across multiple challenging domains.

Authors:Yundi Zhang, Paul Hager, Che Liu, Suprosanna Shit, Chen Chen, Daniel Rueckert, Jiazhen Pan
Title: Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond
Abstract:
Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework. Recent multi-modal approaches have begun to bridge this gap, yet they often rely on limited spatio-temporal data and focus on isolated clinical tasks, thereby hindering the development of a comprehensive representation for cardiac health evaluation. To overcome these limitations, we introduce ViTa, a step toward foundation models that delivers a comprehensive representation of the heart and a precise interpretation of individual disease risk. Leveraging data from 42,000 UK Biobank participants, ViTa integrates 3D+T cine stacks from short-axis and long-axis views, enabling a complete capture of the cardiac cycle. These imaging data are then fused with detailed tabular patient-level factors, enabling context-aware insights. This multi-modal paradigm supports a wide spectrum of downstream tasks, including cardiac phenotype and physiological feature prediction, segmentation, and classification of cardiac and metabolic diseases within a single unified framework. By learning a shared latent representation that bridges rich imaging features and patient context, ViTa moves beyond traditional, task-specific models toward a universal, patient-specific understanding of cardiac health, highlighting its potential to advance clinical utility and scalability in cardiac analysis.
中文: ViTa作为一种多模态基础模型,将心脏磁共振成像与患者健康数据相结合,构建了全面的心脏健康表征,通过统一框架支持多种临床任务应用。
English: ViTa is a multi-modal foundation model that integrates cardiac MRI with patient-level health factors to create a comprehensive representation of cardiac health, enabling diverse clinical tasks through a unified framework.

Authors:Yuyang Li, Wenxin Du, Chang Yu, Puhao Li, Zihang Zhao, Tengyu Liu, Chenfanfu Jiang, Yixin Zhu, Siyuan Huang
Title: Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation
Abstract:
Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. As a promising solution, Vision-Based Tactile Sensors (VBTSs) offer high spatial resolution and cost-effectiveness, but present unique challenges in robotics for their complex physical characteristics and visual signal processing requirements. The lack of efficient and accurate simulation tools for VBTSs has significantly limited the scale and scope of tactile robotics research. We present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development, potentially transforming how robots interact with and understand their physical environment.
中文摘要:Taccel作为高性能仿真平台,能快速精确模拟视觉触觉传感器,以18倍实时速度加速触觉机器人研究,并实现成功的模拟到现实迁移。
English Summary: Taccel is a high-performance simulation platform that enables fast and accurate modeling of vision-based tactile sensors, accelerating tactile robotics research by 18 times real-time speed and supporting successful sim-to-real transfer.

Authors:Mingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, Yiling Lou
Title: Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation
Abstract:
Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model's tendency to generate structurally redundant code, resulting in inefficiencies and reduced readability. To address this, we conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs using three widely-used benchmarks. Our study includes both quantitative and qualitative analyses, revealing that repetition is pervasive and manifests at various granularities and extents, including character, statement, and block levels. We further summarize a taxonomy of 20 repetition patterns. Building on our findings, we propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code. We evaluate DeRep using both open-source benchmarks and in an industrial setting. Our results demonstrate that DeRep significantly outperforms baselines in reducing repetition (with an average improvements of 91.3%, 93.5%, and 79.9% in rep-3, rep-line, and sim-line metrics) and enhancing code quality (with a Pass@1 increase of 208.3% over greedy search). Furthermore, integrating DeRep improves the performance of existing repetition mitigation methods, with Pass@1 improvements ranging from 53.7% to 215.7%.
中文摘要:本研究揭示了大型语言模型生成代码中普遍存在的重复问题,并提出基于规则的DeRep方法,该方法在多个基准测试中显著减少了代码重复并提升了代码质量。
English Summary: This study identifies pervasive code repetition as a major issue in LLM-generated code and introduces DeRep, a rule-based technique that significantly reduces repetition and improves code quality across multiple benchmarks.

Authors:Benjamin Krummenacher, Jonas Frey, Turcan Tuna, Olga Vysotska, Marco Hutter
Title: Diffusion Based Robust LiDAR Place Recognition
Abstract:
Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% +/-2m on average while outperforming baselines at a factor of 2 in mean error.
中文摘要:本文提出了一种基于激光雷达的移动机器人在建筑工地的全局重定位方法,通过结合PointNet++的扩散模型从合成点云中预测多个位置候选,实现了77%的定位精度,并将平均误差降低至基线方法的二分之一。
English Summary: This paper presents a LiDAR-based global repositioning method for mobile robots in construction sites, using a diffusion model with PointNet++ to predict multiple position candidates from synthetic point clouds, achieving 77% accuracy and outperforming baselines by a factor of 2 in mean error.

Authors:Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang
Title: Optimizing Compound Retrieval Systems
Abstract:
Modern retrieval systems do not rely on a single ranking model to construct their rankings. Instead, they generally take a cascading approach where a sequence of ranking models are applied in multiple re-ranking stages. Thereby, they balance the quality of the top-K ranking with computational costs by limiting the number of documents each model re-ranks. However, the cascading approach is not the only way models can interact to form a retrieval system. We propose the concept of compound retrieval systems as a broader class of retrieval systems that apply multiple prediction models. This encapsulates cascading models but also allows other types of interactions than top-K re-ranking. In particular, we enable interactions with large language models (LLMs) which can provide relative relevance comparisons. We focus on the optimization of compound retrieval system design which uniquely involves learning where to apply the component models and how to aggregate their predictions into a final ranking. This work shows how our compound approach can combine the classic BM25 retrieval model with state-of-the-art (pairwise) LLM relevance predictions, while optimizing a given ranking metric and efficiency target. Our experimental results show optimized compound retrieval systems provide better trade-offs between effectiveness and efficiency than cascading approaches, even when applied in a self-supervised manner. With the introduction of compound retrieval systems, we hope to inspire the information retrieval field to more out-of-the-box thinking on how prediction models can interact to form rankings.
中文摘要:现代检索系统正采用复合设计,将BM25等传统模型与大型语言模型的相对相关性预测相结合,在优化排序效果和效率方面超越了传统的级联方法。
English Summary: Modern retrieval systems increasingly adopt compound designs that integrate multiple models, such as combining BM25 with LLM relevance predictions, to optimize ranking effectiveness and efficiency beyond traditional cascading approaches.

Authors:Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
Title: EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
Abstract:
Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
中文摘要:本研究提出EgoExo-Gen模型,通过两阶段方法利用手物交互掩码和文本指令从第三人称视角生成第一人称视角视频帧,在基准测试中展现出优于现有视频预测模型的性能。
English Summary: This study introduces EgoExo-Gen, a two-stage model that leverages hand-object interaction masks and text instructions to generate future first-person video frames from external-view inputs, demonstrating superior performance on benchmark datasets.

Authors:Xiaohua Feng, Yuyuan Li, Fengyuan Yu, Ke Xiong, Junjie Fang, Li Zhang, Tianyu Du, Chaochao Chen
Title: RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems
Abstract:
In various networks and mobile applications, users are highly susceptible to attribute inference attacks, with particularly prevalent occurrences in recommender systems. Attackers exploit partially exposed user profiles in recommendation models, such as user embeddings, to infer private attributes of target users, such as gender and political views. The goal of defenders is to mitigate the effectiveness of these attacks while maintaining recommendation performance. Most existing defense methods, such as differential privacy and attribute unlearning, focus on post-training settings, which limits their capability of utilizing training data to preserve recommendation performance. Although adversarial training extends defenses to in-training settings, it often struggles with convergence due to unstable training processes. In this paper, we propose RAID, an in-training defense method against attribute inference attacks in recommender systems. In addition to the recommendation objective, we define a defensive objective to ensure that the distribution of protected attributes becomes independent of class labels, making users indistinguishable from attribute inference attacks. Specifically, this defensive objective aims to solve a constrained Wasserstein barycenter problem to identify the centroid distribution that makes the attribute indistinguishable while complying with recommendation performance constraints. To optimize our proposed objective, we use optimal transport to align users with the centroid distribution. We conduct extensive experiments on four real-world datasets to evaluate RAID. The experimental results validate the effectiveness of RAID and demonstrate its significant superiority over existing methods in multiple aspects.
中文摘要:本文提出RAID方法,通过在训练过程中利用最优传输将用户分布与质心对齐,使受保护属性无法被推断,同时保持推荐系统的性能。
English Summary: The paper introduces RAID, an in-training defense method that uses optimal transport to align user distributions with a centroid, making attributes indistinguishable to attackers while preserving recommendation performance.

Authors:Wenyi Zhang, Ju Jia, Xiaojun Jia, Yihao Huang, Xinfeng Li, Cong Wu, Lina Wang
Title: PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage
Abstract:
The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.
Chinese: 该摘要提出PATFinger方案,通过全局最优扰动和自适应提示从训练无关角度实现多模态数据集指纹识别,能有效追踪跨模态交互特征来验证数据集使用权,在跨模态检索架构上的防未经授权使用效果比现有技术提升30%。
English: This abstract introduces PATFinger, a training-free fingerprinting scheme that uses global optimal perturbations and adaptive prompts to verify multimodal dataset ownership by capturing cross-modal interactions, achieving 30% higher effectiveness against unauthorized usage than existing methods.

Authors:Yunyang Cao, Juekai Lin, Hongye Wang, Wenhao Li, Bo Jin
Title: Interpretable Hybrid-Rule Temporal Point Processes
Abstract:
Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.
中文: 提出的混合规则时序点过程(HRTPP)通过将时序逻辑规则与数值特征相结合,在医疗事件建模中同时提升了可解释性和预测准确性,经真实数据集验证其优于现有方法。
English: The proposed Hybrid-Rule Temporal Point Process (HRTPP) integrates temporal logic rules with numerical features to enhance both interpretability and predictive accuracy in medical event modeling, outperforming existing methods through a novel framework validated on real-world datasets.

Authors:Hanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, Mohsen Imani
Title: LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation
Abstract:
Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.
大型视觉语言模型(LVLMs)能有效引导视觉模型完成推理分割任务,但处理大量图像令牌带来高昂计算成本,而提出的LVLM_CSP方法通过三阶段剪枝策略,在减少65-70%令牌的同时几乎保持精度无损。
Large Vision Language Models (LVLMs) effectively guide vision models for reasoning segmentation but face high computational costs from processing numerous image tokens, which the proposed LVLM_CSP method addresses by reducing tokens by 65-70% with minimal accuracy loss.

Authors:Tahrim Hossain, Sakib Hassan, Faisal Haque Bappy, Muhammad Nur Yanhaona, Sarker Ahmed Rumee, Moinul Zaber, Tariqul Islam
Title: FlexiContracts: A Novel and Efficient Scheme for Upgrading Smart Contracts in Ethereum Blockchain
Abstract:
Blockchain technology has revolutionized contractual processes, enhancing efficiency and trust through smart contracts. Ethereum, as a pioneer in this domain, offers a platform for decentralized applications but is challenged by the immutability of smart contracts, which makes upgrades cumbersome. Existing design patterns, while addressing upgradability, introduce complexity, increased development effort, and higher gas costs, thus limiting their effectiveness. In response, we introduce FlexiContracts, an innovative scheme that reimagines the evolution of smart contracts on Ethereum. By enabling secure, in-place upgrades without losing historical data, FlexiContracts surpasses existing approaches, introducing a previously unexplored path in smart contract evolution. Its streamlined design transcends the limitations of current design patterns by simplifying smart contract development, eliminating the need for extensive upfront planning, and significantly reducing the complexity of the design process. This advancement fosters an environment for continuous improvement and adaptation to new requirements, redefining the possibilities for dynamic, upgradable smart contracts.
Chinese Summary: FlexiContracts提出了一种创新的方案,在以太坊上实现安全、原地升级智能合约并保留历史数据,克服了现有可升级模式复杂且成本高昂的局限性。
English Summary: FlexiContracts introduces a novel scheme for Ethereum that enables secure, in-place smart contract upgrades while preserving historical data, overcoming the limitations of existing complex and costly upgradability patterns.

Authors:Afra Amini, Tim Vieira, Ryan Cotterell
Title: Better Estimation of the KL Divergence Between Language Models
Abstract:
Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.
中文: 本文提出了一种用于语言模型间KL散度的Rao-Blackwellized估计器,相比标准蒙特卡洛方法能降低方差并提供更稳定的估计,同时扩展到梯度估计以提升训练稳定性和模型性能。
English: This paper introduces a Rao-Blackwellized estimator for KL divergence between language models, which reduces variance and provides more stable estimates compared to standard Monte Carlo methods, also extending to gradient estimation for improved training stability and model performance.

Authors:Ismail Cosandal, Sennur Ulukus, Nail Akar
Title: Minimizing Functions of Age of Incorrect Information for Remote Estimation
Abstract:
The age of incorrect information (AoII) process which keeps track of the time since the source and monitor processes are in sync, has been extensively used in remote estimation problems. In this paper, we consider a push-based remote estimation system with a discrete-time Markov chain (DTMC) information source transmitting status update packets towards the monitor once the AoII process exceeds a certain estimation-based threshold. In this paper, the time average of an arbitrary function of AoII is taken as the AoII cost, as opposed to using the average AoII as the mismatch metric, whereas this function is also allowed to depend on the estimation value. In this very general setting, our goal is to minimize a weighted sum of AoII and transmission costs. For this purpose, we formulate a discrete-time semi-Markov decision process (SMDP) regarding the multi-threshold status update policy. We propose a novel tool in discrete-time called 'dual-regime absorbing Markov chain' (DR-AMC) and its corresponding absorption time distribution named as 'dual-regime phase-type' (DR-PH) distribution, to obtain the characterizing parameters of the SMDP, which allows us to obtain the distribution of the AoII process for a given policy, and hence the average of any function of AoII. The proposed method is validated with numerical results by which we compare our proposed method against other policies obtained by exhaustive-search, and also various benchmark policies.
中文摘要:本文提出了一种基于推送的远程估计系统,通过采用多阈值策略和构建半马尔可夫决策过程,最小化信息错误年龄与传输成本的加权和,并利用新型双机制马尔可夫链工具分析信息错误年龄分布,通过数值结果验证了该方法优于其他策略。
English Summary: This paper introduces a push-based remote estimation system that minimizes a weighted sum of age of incorrect information (AoII) and transmission costs by employing a multi-threshold policy and formulating a semi-Markov decision process, using novel dual-regime Markov chain tools to analyze AoII distribution and validate the method against other policies.

Authors:Yiting Wang, Wanghao Ye, Ping Guo, Yexiao He, Ziyao Wang, Bowei Tian, Shwai He, Guoheng Sun, Zheyu Shen, Sihan Chen, Ankur Srivastava, Qingfu Zhang, Gang Qu, Ang Li
Title: SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning
Abstract:
Optimizing Register Transfer Level (RTL) code is crucial for improving the power, performance, and area (PPA) of digital circuits in the early stages of synthesis. Manual rewriting, guided by synthesis feedback, can yield high-quality results but is time-consuming and error-prone. Most existing compiler-based approaches have difficulty handling complex design constraints. Large Language Model (LLM)-based methods have emerged as a promising alternative to address these challenges. However, LLM-based approaches often face difficulties in ensuring alignment between the generated code and the provided prompts. This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework that seamlessly integrates LLM-based code rewriting with symbolic reasoning techniques. Our method incorporates a retrieval-augmented generation (RAG) system of optimization rules and Abstract Syntax Tree (AST)-based templates, enabling LLM-based rewriting that maintains syntactic correctness while minimizing undesired circuit behaviors. A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic, allowing fine-grained state merging and partial specification handling beyond the scope of pattern-based compilers. Furthermore, a fast verification pipeline, combining formal equivalence checks with test-driven validation, further reduces the complexity of verification. Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively, compared to the state-of-the-art methods.
中文摘要:SymRTLO是一种神经符号框架,通过结合基于大语言模型的RTL代码重写与符号推理及验证技术,在确保代码正确性的同时显著提升了功耗、性能和面积的优化效果。
English Summary: SymRTLO is a neuron-symbolic framework that integrates LLM-based RTL code rewriting with symbolic reasoning and verification techniques to significantly enhance power, performance, and area optimization while ensuring code correctness.

Authors:Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Chun Yuan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang
Title: InstructEngine: Instruction-driven Text-to-Image Alignment
Abstract:
Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.
Chinese: InstructEngine框架通过自动化偏好数据生成和交叉验证方法,解决了文本到图像模型对齐中的数据与算法限制,显著提升了模型性能并更好地符合人类偏好。
English: The InstructEngine framework addresses limitations in text-to-image model alignment by automating preference data generation and introducing cross-validation, achieving significant performance improvements and better human preference alignment.

Authors:Tahrim Hossain, Sakib Hassan, Faisal Haque Bappy, Muhammad Nur Yanhaona, Tarannum Shaila Zaman, Tariqul Islam
Title: Bridging Immutability with Flexibility: A Scheme for Secure and Efficient Smart Contract Upgrades
Abstract:
The emergence of blockchain technology has revolutionized contract execution through the introduction of smart contracts. Ethereum, the leading blockchain platform, leverages smart contracts to power decentralized applications (DApps), enabling transparent and self-executing systems across various domains. While the immutability of smart contracts enhances security and trust, it also poses significant challenges for updates, defect resolution, and adaptation to changing requirements. Existing upgrade mechanisms are complex, resource-intensive, and costly in terms of gas consumption, often compromising security and limiting practical adoption. To address these challenges, we propose FlexiContracts+, a novel scheme that reimagines smart contracts by enabling secure, in-place upgrades on Ethereum while preserving historical data without relying on multiple contracts or extensive pre-deployment planning. FlexiContracts+ enhances security, simplifies development, reduces engineering overhead, and supports adaptable, expandable smart contracts. Comprehensive testing demonstrates that FlexiContracts+ achieves a practical balance between immutability and flexibility, advancing the capabilities of smart contract systems.
中文: FlexiContracts+ 提出了一种创新方案,可在以太坊上实现安全、原地的智能合约升级,在保持历史数据完整性的同时平衡不变性与灵活性,并降低开发复杂度。
English: FlexiContracts+ introduces a novel scheme enabling secure, in-place upgrades for Ethereum smart contracts, balancing immutability with flexibility while preserving historical data and reducing development complexity.

Authors:Jia Wei, Xiaoqi Zhao, Jonghye Woo, Jinsong Ouyang, Georges El Fakhri, Qingyu Chen, Xiaofeng Liu
Title: Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation
Abstract:
Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.
中文: 提出的混合形状专家框架通过将专家混合训练与字典学习相结合,有效捕捉多样化且鲁棒的形状先验,并利用Segment Anything模型的强大泛化能力,显著提升了医学图像分割中的单域泛化性能。
English: The proposed Mixture-of-Shape-Experts (MoSE) framework enhances single domain generalization in medical image segmentation by integrating mixture-of-experts training with dictionary learning to capture robust shape priors, which are dynamically fused and provided as prompts to the Segment Anything Model for improved performance.

Authors:Tahrim Hossain, Faisal Haque Bappy, Tarannum Shaila Zaman, Tariqul Islam
Title: CrossLink: A Decentralized Framework for Secure Cross-Chain Smart Contract Execution
Abstract:
This paper introduces CrossLink, a decentralized framework for secure cross-chain smart contract execution that effectively addresses the inherent limitations of contemporary solutions, which primarily focus on asset transfers and rely on potentially vulnerable centralized intermediaries. Recognizing the escalating demand for seamless interoperability among decentralized applications, CrossLink provides a trustless mechanism for smart contracts across disparate blockchain networks to communicate and interact. At its core, CrossLink utilizes a compact chain for selectively storing authorized contract states and employs a secure inter-chain messaging mechanism to ensure atomic execution and data consistency. By implementing a deposit/collateral fee system and efficient state synchronization, CrossLink enhances security and mitigates vulnerabilities, offering a novel approach to seamless, secure, and decentralized cross-chain interoperability. A formal security analysis further validates CrossLink's robustness against unauthorized modifications and denial-of-service attacks.
中文:CrossLink是一种去中心化框架,通过采用精简链存储合约状态和跨链消息机制,实现了安全、无需信任的跨链智能合约执行,确保原子性与数据一致性,并利用抵押系统增强安全性。
English: CrossLink is a decentralized framework enabling secure and trustless cross-chain smart contract execution through a compact chain for state storage and inter-chain messaging, ensuring atomicity and data consistency while mitigating vulnerabilities with a collateral system.

Authors:Tahrim Hossain, Faisal Haque Bappy, Tarannum Shaila Zaman, Raiful Hasan, Tariqul Islam
Title: SmartShift: A Secure and Efficient Approach to Smart Contract Migration
Abstract:
Blockchain and smart contracts have emerged as revolutionary technologies transforming distributed computing. While platform evolution and smart contracts' inherent immutability necessitate migrations both across and within chains, migrating the vast amounts of critical data in these contracts while maintaining data integrity and minimizing operational disruption presents a significant challenge. To address these challenges, we present SmartShift, a framework that enables secure and efficient smart contract migrations through intelligent state partitioning and progressive function activation, preserving operational continuity during transitions. Our comprehensive evaluation demonstrates that SmartShift significantly reduces migration downtime while ensuring robust security, establishing a foundation for efficient and secure smart contract migration systems.
中文: SmartShift框架通过智能状态分区和渐进式功能激活,实现了安全高效的智能合约迁移,在显著减少停机时间的同时确保了系统安全性。
English: SmartShift is a framework that enables secure and efficient smart contract migrations through intelligent state partitioning and progressive function activation, significantly reducing downtime while ensuring robust security.

Authors:Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen
Title: Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Abstract:
Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.
Chinese: 提出的解耦对比解码(DCD)框架通过分离正负样本学习,利用训练后的投影隐式建模真实幻觉模式,有效缓解多模态大语言模型的幻觉问题,同时保持其通用推理能力。
English: The proposed Decoupling Contrastive Decoding (DCD) framework mitigates hallucinations in multimodal large language models by decoupling positive and negative sample learning, using trained projections to implicitly model authentic hallucination patterns without sacrificing general reasoning capabilities.

Authors:Krishna C. Puvvada, Faisal Ladhak, Santiago Akle Serrano, Cheng-Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg
Title: SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling
Abstract:
We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.
中文: SWAN-GPT是一种仅解码器的Transformer模型,通过结合无位置编码层和滑动窗口注意力的新颖架构,能够稳健地外推到比训练时更长的序列,无需额外长上下文训练即可实现计算效率提升和上下文扩展能力。
English: SWAN-GPT is a decoder-only Transformer model that robustly extrapolates to longer sequences than seen in training through a novel architecture combining positional encoding-free layers and sliding-window attention, achieving computational efficiency and extended context capabilities without additional long-context training.

Authors:Zen Kit Heng, Zimeng Zhao, Tianhao Wu, Yuanfei Wang, Mingdong Wu, Yangang Wang, Hao Dong
Title: Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution
Abstract:
Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.
中文摘要:本文提出了一种新颖框架,通过探索缓存和文本-代码协调策略改进基于大语言模型的强化学习奖励设计,在基准任务中展现出更高的有效性和稳定性。
English Summary: This paper introduces a novel framework that improves LLM-driven reinforcement learning reward design by evolving the Reward Observation Space through exploration caching and text-code reconciliation, demonstrating enhanced effectiveness and stability in benchmark tasks.

Authors:Constantin Ulrich, Tassilo Wald, Fabian Isensee, Klaus H. Maier-Hein
Title: Large Scale Supervised Pretraining For Traumatic Brain Injury Segmentation
Abstract:
The segmentation of lesions in Moderate to Severe Traumatic Brain Injury (msTBI) presents a significant challenge in neuroimaging due to the diverse characteristics of these lesions, which vary in size, shape, and distribution across brain regions and tissue types. This heterogeneity complicates traditional image processing techniques, resulting in critical errors in tasks such as image registration and brain parcellation. To address these challenges, the AIMS-TBI Segmentation Challenge 2024 aims to advance innovative segmentation algorithms specifically designed for T1-weighted MRI data, the most widely utilized imaging modality in clinical practice. Our proposed solution leverages a large-scale multi-dataset supervised pretraining approach inspired by the MultiTalent method. We train a Resenc L network on a comprehensive collection of datasets covering various anatomical and pathological structures, which equips the model with a robust understanding of brain anatomy and pathology. Following this, the model is fine-tuned on msTBI-specific data to optimize its performance for the unique characteristics of T1-weighted MRI scans and outperforms the baseline without pretraining up to 2 Dice points.
中文: AIMS-TBI 2024分割挑战赛针对中重度脑损伤病灶的多样性难题,通过多数据集预训练和针对性微调的模型,在T1加权核磁共振影像上使分割效果较基线提升高达2个Dice值。
English: The AIMS-TBI Segmentation Challenge 2024 addresses the difficulty in segmenting diverse msTBI lesions by developing specialized algorithms, where our model uses multi-dataset pretraining and fine-tuning to outperform the baseline by up to 2 Dice points on T1-weighted MRI scans.

Authors:Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, Chaochao Chen
Title: Bridging the Gap Between Preference Alignment and Machine Unlearning
Abstract:
Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.
中文摘要:本研究提出“遗忘对齐”(U2A)框架,通过双层优化选择性遗忘负面示例,比传统方法更高效地提升大语言模型的偏好对齐性能。
English Summary: The study introduces Unlearning to Align (U2A), a framework using bi-level optimization to selectively unlearn negative examples, enhancing Preference Alignment in Large Language Models more efficiently than traditional methods.

Authors:Xiaohua Feng, Yuyuan Li, Chengye Wang, Junlin Liu, Li Zhang, Chaochao Chen
Title: A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty
Abstract:
Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm's design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty ($\mathrm{MRD}$) metric to quantify sample-level unlearning difficulty. Using $\mathrm{MRD}$, we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an $\mathrm{MRD}$-based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.
中文摘要:本研究提出记忆移除难度(MRD)指标来量化大语言模型中样本级别的遗忘难度,并通过基于MRD的加权采样方法优化现有遗忘算法,优先处理易遗忘样本以提升遗忘效率,实验验证证实了该方法的有效性。
English Summary: This study introduces a Memory Removal Difficulty (MRD) metric to quantify sample-level unlearning difficulty in LLMs and proposes an MRD-based weighted sampling method that enhances unlearning efficiency by prioritizing easily forgettable samples, with experimental validation confirming its effectiveness.

Authors:Longguang Zhong, Fanqi Wan, Ziyi Yang, Guosheng Liang, Tianyuan Shi, Xiaojun Quan
Title: FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion
Abstract:
Heterogeneous model fusion enhances the performance of LLMs by integrating the knowledge and capabilities of multiple structurally diverse models. However, existing approaches often rely solely on selecting the best output for each prompt from source models, which underutilizes their full potential due to limited source knowledge and results in sparse optimization signals. To address this limitation, we propose FuseRL, a novel two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs. FuseSFT establishes a robust initialization by integrating the strengths of heterogeneous source models through weighted supervised fine-tuning (SFT) on diverse outputs for each prompt. FusePO optimizes weighted preferences based on the outputs of multiple source models to enable superior alignment performance. Extensive experiments demonstrate the effectiveness of our framework across various preference alignment methods, including RLOO, DPO, and SimPO. Using Llama-3.1-8B-Instruct as the target model, our approach achieves state-of-the-art performance among 8B LLMs on the AlpacaEval-2 and Arena-Hard benchmarks. Further analysis suggests that FuseSFT regularizes the training process to reduce overfitting, while FusePO introduces dense and diverse signals for preference optimization.
Chinese: 异构模型融合通过整合多样模型提升大语言模型性能,但现有方法仅选择每个提示的最佳输出而未能充分利用其潜力,为此提出FuseRL框架,通过加权微调和偏好优化的两阶段方法强化模型初始化与对齐能力,在多项基准测试中取得最优性能。
English: Heterogeneous model fusion improves LLM performance by integrating diverse models, but existing methods underutilize their potential by merely selecting the best output per prompt, prompting the development of FuseRL, a two-stage framework that enhances model initialization and alignment through weighted fine-tuning and preference optimization, achieving state-of-the-art results on benchmarks.

Authors:Julian Nubert, Turcan Tuna, Jonas Frey, Cesar Cadena, Katherine J. Kuchenbecker, Shehryar Khattak, Marco Hutter
Title: Holistic Fusion: Task- and Setup-Agnostic Robot Localization and State Estimation with Factor Graphs
Abstract:
Seamless operation of mobile robots in challenging environments requires low-latency local motion estimation (e.g., dynamic maneuvers) and accurate global localization (e.g., wayfinding). While most existing sensor-fusion approaches are designed for specific scenarios, this work introduces a flexible open-source solution for task- and setup-agnostic multimodal sensor fusion that is distinguished by its generality and usability. Holistic Fusion formulates sensor fusion as a combined estimation problem of i) the local and global robot state and ii) a (theoretically unlimited) number of dynamic context variables, including automatic alignment of reference frames; this formulation fits countless real-world applications without any conceptual modifications. The proposed factor-graph solution enables the direct fusion of an arbitrary number of absolute, local, and landmark measurements expressed with respect to different reference frames by explicitly including them as states in the optimization and modeling their evolution as random walks. Moreover, local smoothness and consistency receive particular attention to prevent jumps in the robot state belief. HF enables low-latency and smooth online state estimation on typical robot hardware while simultaneously providing low-drift global localization at the IMU measurement rate. The efficacy of this released framework is demonstrated in five real-world scenarios on three robotic platforms, each with distinct task requirements.
中文摘要:本研究提出Holistic Fusion这一灵活的开源多模态传感器融合框架,能够在无需概念修改的情况下实现低延迟局部运动估计和精确全局定位,适用于各种机器人应用场景。
English Summary: This work introduces Holistic Fusion, a flexible open-source framework for multimodal sensor fusion that enables seamless low-latency local motion estimation and accurate global localization across diverse robotic applications without requiring conceptual modifications.

Authors:Mingye Zhu, Yi Liu, Zheren Fu, Yongdong Zhang, Zhendong Mao
Title: Leveraging Robust Optimization for LLM Alignment under Distribution Shifts
Abstract:
Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distribution shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.
中文: 本文提出了一种新颖的分布感知优化框架,通过校准训练样本并聚焦于人类偏好分布的优化,来增强大型语言模型的偏好对齐,从而减轻合成数据带来的分布偏移问题。
English: A novel distribution-aware optimization framework is proposed to enhance preference alignment in large language models by calibrating training samples and focusing optimization on human-preferred distributions, mitigating distribution shifts from synthetic data.

Authors:Jianling Wang, Yifan Liu, Yinghao Sun, Xuejian Ma, Yueqi Wang, He Ma, Zhengyang Su, Minmin Chen, Mingyan Gao, Onkar Dalal, Ed H. Chi, Lichan Hong, Ningren Han, Haokai Lu
Title: User Feedback Alignment for LLM-powered Exploration in Large-scale Recommendation Systems
Abstract:
Exploration, the act of broadening user experiences beyond their established preferences, is challenging in large-scale recommendation systems due to feedback loops and limited signals on user exploration patterns. Large Language Models (LLMs) offer potential solutions by leveraging their world knowledge to recommend novel content outside these loops. A key challenge is aligning LLMs with user preferences while preserving their knowledge and reasoning. To enhance planning for new user interests using LLMs, this paper introduces a novel approach that combines hierarchical planning with LLM inference-time scaling. This method aims to improve recommendation relevancy without compromising novelty. We decouple novelty and user-alignment, training separate LLMs for each objective. We then scale up the novelty-focused LLM's inference and select the best-of-n predictions using the user-aligned LLM. Live experiments demonstrate efficacy, showing significant gains in both user satisfaction (measured by watch activity and active user counts) and exploration diversity.
中文摘要:本文提出了一种结合分层规划和LLM推理时扩展的新方法,通过分离新颖性和用户对齐目标,在实时实验中显著提升了用户满意度和探索多样性。
English Summary: This paper introduces a hierarchical planning method with LLM inference-time scaling to enhance recommendation systems by decoupling novelty and user-alignment, achieving significant gains in user satisfaction and exploration diversity through live experiments.

Authors:Benjamin Lipkin, Benjamin LeBrun, Jacob Hoover Vigly, João Loula, David R. MacIver, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Timothy J. O'Donnell, Alexander K. Lew, Tim Vieira
Title: Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
Abstract:
The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.
中文: 本文提出一种新算法,通过自适应拒绝采样减少约束评估次数,并提供无偏重要性权重估计,以解决局部约束解码的效率低下和分布扭曲问题,从而提升全局采样准确性。
English: This paper introduces a novel algorithm that overcomes the inefficiency and distributional distortion of locally constrained decoding by using adaptive rejection sampling to reduce constraint evaluations and providing unbiased importance weight estimates for improved global sampling accuracy.

Authors:Jiuyang Bu, Wenkai Li, Zongwei Li, Zeng Zhang, Xiaoqi Li
Title: Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM
Abstract:
Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts, with traditional detection methods struggling to address emerging and machine-unauditable flaws. This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection. We introduce a comprehensive dataset of 215 real-world DApp projects (4,998 contracts), including hard-to-detect logical errors like token price manipulation, addressing the limitations of existing simplified benchmarks. By fine-tuning LLMs (Llama3-8B and Qwen2-7B) with Full-Parameter Fine-Tuning (FFT) and Low-Rank Adaptation (LoRA), our method achieves superior performance, attaining an F1-score of 0.83 with FFT and data augmentation via Random Over Sampling (ROS). Comparative experiments demonstrate significant improvements over prompt-based LLMs and state-of-the-art tools. Notably, the approach excels in detecting non-machine-auditable vulnerabilities, achieving 0.97 precision and 0.68 recall for price manipulation flaws. The results underscore the effectiveness of domain-specific LLM fine-tuning and data augmentation in addressing real-world DApp security challenges, offering a robust solution for blockchain ecosystem protection.
中文: 本文提出一种基于微调大语言模型的新方法,通过全面数据集训练和数据增强技术,显著提升了去中心化应用中智能合约漏洞检测能力,尤其在识别价格操纵等难以发现的逻辑错误方面表现卓越。
English: This paper introduces a fine-tuned Large Language Model approach that significantly improves smart contract vulnerability detection in DApps, achieving superior performance in identifying hard-to-detect logical errors through comprehensive dataset training and data augmentation techniques.

Authors:Jiuyang Bu, Wenkai Li, Zongwei Li, Zeng Zhang, Xiaoqi Li
Title: SmartBugBert: BERT-Enhanced Vulnerability Detection for Smart Contract Bytecode
Abstract:
Smart contracts deployed on blockchain platforms are vulnerable to various security vulnerabilities. However, only a small number of Ethereum contracts have released their source code, so vulnerability detection at the bytecode level is crucial. This paper introduces SmartBugBert, a novel approach that combines BERT-based deep learning with control flow graph (CFG) analysis to detect vulnerabilities directly from bytecode. Our method first decompiles smart contract bytecode into optimized opcode sequences, extracts semantic features using TF-IDF, constructs control flow graphs to capture execution logic, and isolates vulnerable CFG fragments for targeted analysis. By integrating both semantic and structural information through a fine-tuned BERT model and LightGBM classifier, our approach effectively identifies four critical vulnerability types: transaction-ordering, access control, self-destruct, and timestamp dependency vulnerabilities. Experimental evaluation on 6,157 Ethereum smart contracts demonstrates that SmartBugBert achieves 90.62% precision, 91.76% recall, and 91.19% F1-score, significantly outperforming existing detection methods. Ablation studies confirm that the combination of semantic features with CFG information substantially enhances detection performance. Furthermore, our approach maintains efficient detection speed (0.14 seconds per contract), making it practical for large-scale vulnerability assessment.
中文: 本文提出SmartBugBert方法,结合基于BERT的深度学习与控制流图分析,能有效检测智能合约字节码中的四类关键漏洞,精确率和召回率均超过90%,同时保持高效的大规模检测能力。
English: This paper presents SmartBugBert, a BERT-based deep learning method combined with control flow graph analysis that effectively detects four critical vulnerabilities in smart contract bytecode with over 90% precision and recall, while maintaining high efficiency for large-scale assessments.

Authors:Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, Siyuan Huang
Title: GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
Abstract:
Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance -- LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.
中文摘要:GROVE提出了一种通用奖励框架,利用大语言模型和视觉语言模型的互补优势,通过迭代约束优化和直接姿态映射,无需人工设计即可实现开放词汇的物理技能学习,显著提升了动作自然度和训练效率。
English Summary: GROVE introduces a generalized reward framework that leverages LLMs and VLMs to enable open-vocabulary physical skill learning without manual engineering, achieving superior motion naturalness and efficiency through iterative constraint refinement and direct pose projection.

Authors:Kepu Zhang, Zhongxiang Sun, Weijie Yu, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li, Jun Xu
Title: QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors
Abstract:
Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors such as keyboard proximity errors, visual similarity errors, and spelling errors are frequent. The impact of these errors on current RAG methods against such errors remains largely unexplored. To bridge this gap, we propose QE-RAG, the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We augment six widely used datasets by injecting three common types of query entry errors into randomly selected user queries at rates of 20\% and 40\%, simulating typical user behavior in real-world scenarios. We analyze the impact of these errors on LLM outputs and find that corrupted queries degrade model performance, which can be mitigated through query correction and training a robust retriever for retrieving relevant documents. Based on these insights, we propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method. Extensive in-domain and cross-domain experiments reveal that: (1) state-of-the-art RAG methods including sequential, branching, and iterative methods, exhibit poor robustness to query entry errors; (2) our method significantly enhances the robustness of RAG when handling query entry errors and it's compatible with existing RAG methods, further improving their robustness.
中文: 现有RAG方法对常见的查询输入错误缺乏鲁棒性,为此我们提出了QE-RAG基准,通过向数据集中注入错误并采用查询校正和对比学习训练方法,显著提升了RAG系统的抗干扰能力。
English: Current RAG methods lack robustness against common query entry errors, prompting the development of QE-RAG, a benchmark that injects such errors into datasets and introduces correction and training techniques to significantly enhance RAG's resilience.

Authors:Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, Boris Ginsburg
Title: OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
Abstract:
Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.
大型语言模型正在推动软件开发进步,但受限于缺乏高质量的编程监督微调数据集,而OpenCodeInstruct通过500万多样化样本填补了这一空白,显著提升了各类基准测试中的模型性能。
Large Language Models are advancing software development but face limitations due to a lack of high-quality supervised fine-tuning datasets for coding, which OpenCodeInstruct addresses with 5 million diverse samples to significantly enhance model performance across various benchmarks.

Authors:Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, Pattie Maes
Title: Investigating Affective Use and Emotional Well-being on ChatGPT
Abstract:
As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users' emotional well-being, behaviors and experiences through two parallel studies. To study the affective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 3 million conversations for affective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under different experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we find that the impact of voice-based interactions on emotional well-being to be highly nuanced, and influenced by factors such as the user's initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most affective cues.
中文摘要:本研究通过大规模数据分析及受控实验,发现过度使用ChatGPT(特别是语音模式)可能引发用户依赖,且其对情绪健康的影响因用户初始状态和使用时长呈现复杂差异。
English Summary: This study examines how interactions with ChatGPT, especially its voice mode, affect users' emotional well-being through large-scale data analysis and controlled trials, revealing nuanced impacts including potential dependence from high usage.

Authors:Litao Hua, Fan Liu, Jie Su, Xingyu Miao, Zizhou Ouyang, Zeyu Wang, Runze Hu, Zhenyu Wen, Bing Zhai, Yang Long, Haoran Duan, Yuan Zhou
Title: Attention in Diffusion Model: A Survey
Abstract:
Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect, offering a clear lens through which to understand their functional diversity. In addition to reviewing architectural innovations, we examine how attention mechanisms contribute to performance improvements in diverse applications. We also identify current limitations and underexplored areas, and outline potential directions for future research. Our study provides valuable insights into the evolving landscape of diffusion models, with a particular focus on the integrative and ubiquitous role of attention.
中文: 本综述系统分析了注意力机制在扩散模型中的作用、设计模式与操作,提出了统一分类法并考察其在各应用中的性能贡献,同时指出了未来研究方向。
English: This comprehensive survey systematically analyzes the roles, design patterns, and operations of attention mechanisms in diffusion models, proposing a unified taxonomy and examining their performance contributions across applications while identifying future research directions.

Authors:Kai Lascheit, Daniel Barath, Marc Pollefeys, Leonidas Guibas, Francis Engelmann
Title: Robust Human Registration with Body Part Segmentation on Noisy Point Clouds
Abstract:
Registering human meshes to 3D point clouds is essential for applications such as augmented reality and human-robot interaction but often yields imprecise results due to noise and background clutter in real-world data. We introduce a hybrid approach that incorporates body-part segmentation into the mesh fitting process, enhancing both human pose estimation and segmentation accuracy. Our method first assigns body part labels to individual points, which then guide a two-step SMPL-X fitting: initial pose and orientation estimation using body part centroids, followed by global refinement of the point cloud alignment. Additionally, we demonstrate that the fitted human mesh can refine body part labels, leading to improved segmentation. Evaluations on the cluttered and noisy real-world datasets InterCap, EgoBody, and BEHAVE show that our approach significantly outperforms prior methods in both pose estimation and segmentation accuracy. Code and results are available on our project website: https://segfit.github.io
中文: 本文提出了一种将身体部位分割融入人体网格配准的混合方法,在嘈杂的真实世界数据集上显著提升了姿态估计和分割精度。
English: This paper presents a hybrid method that integrates body-part segmentation into human mesh registration, significantly improving pose estimation and segmentation accuracy on noisy real-world datasets.

Authors:Sophie Hall, Florian Dörfler, Heinrich H. Nax, Saverio Bolognani
Title: The Limits of "Fairness" of the Variational Generalized Nash Equilibrium
Abstract:
Generalized Nash equilibrum (GNE) problems are commonly used to model strategic interactions between self-interested agents who are coupled in cost and constraints. Specifically, the variational GNE, a refinement of the GNE, is often selected as the solution concept due to it's non-discriminatory treatment of agents by charging a uniform ``shadow price" for shared resources. We study the fairness concept of v-GNEs from a comparability perspective and show that it makes an implicit assumption of unit comparability of agent's cost functions, one of the strongest comparability notions. Further, we introduce a new solution concept, f-GNE in which a fairness metric is chosen a priori which is compatible with the comparability at hand. We introduce an electric vehicle charging game to demonstrate the fragility of v-GNE fairness and compare it to the f-GNE under various fairness metrics.
Chinese: 本研究分析了变分广义纳什均衡(v-GNE)的公平性,揭示了其对智能体成本函数单位可比性的隐含假设,并提出新的解决方案f-GNE,通过预先设定公平度量来弥补这一缺陷,并以电动汽车充电博弈为例进行了验证。
English: The study examines the fairness of variational generalized Nash equilibrium (v-GNE), revealing its implicit assumption of unit comparability in agents' cost functions, and introduces a new solution concept, f-GNE, which incorporates a pre-selected fairness metric to address this limitation, as demonstrated through an electric vehicle charging game.

Authors:Irena Radišić, Francesco Regazzoni, Michele Bucelli, Stefano Pagani, Luca Dede', Alfio Quarteroni
Title: Influence of cellular mechano-calcium feedback in numerical models of cardiac electromechanics
Abstract:
Multiphysics and multiscale mathematical models enable the non-invasive study of cardiac function. These models often rely on simplifying assumptions that neglect certain biophysical processes to balance fidelity and computational cost. In this work, we propose an eikonal-based framework that incorporates mechano-calcium feedback -- the effect of mechanical deformation on calcium-troponin buffering -- while introducing only negligible computational overhead. To assess the impact of mechano-calcium feedback at the organ level, we develop a bidirectionally coupled cellular electromechanical model and integrate it into two cardiac multiscale frameworks: a monodomain-driven model that accounts for geometric feedback on electrophysiology and the proposed eikonal-based approach, which instead neglects geometric feedback. By ensuring consistent cellular model calibration across all scenarios, we isolate the role of mechano-calcium feedback and systematically compare its effects against models without it. Our results indicate that, under baseline conditions, mechano-calcium feedback has minimal influence on overall cardiac function. However, its effects become more pronounced in altered force generation scenarios, such as inotropic modulation. Furthermore, we demonstrate that the eikonal-based framework, despite omitting other types of mechano-electric feedback, effectively captures the role of mechano-calcium feedback at significantly lower computational costs than the monodomain-driven model, reinforcing its utility in computational cardiology.
中文: 本研究提出了一种基于程函方程的框架,以可忽略的计算成本整合了机械-钙反馈,结果显示其在正常条件下影响甚微,但在力生成改变时作用显著,且比传统模型更高效。
English: The study introduces an eikonal-based framework that incorporates mechano-calcium feedback with minimal computational overhead, showing it has little effect under normal conditions but becomes significant during altered force generation, while proving more efficient than traditional models.

Authors:Ioannis Anagnostides, Gabriele Farina, Tuomas Sandholm, Brian Hu Zhang
Title: A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition
Abstract:
Solving (Stampacchia) variational inequalities (SVIs) is a foundational problem at the heart of optimization, with a host of critical applications ranging from engineering to economics. However, this expressivity comes at the cost of computational hardness. As a result, most research has focused on carving out specific subclasses that elude those intractability barriers. A classical property that goes back to the 1960s is the Minty condition, which postulates that the Minty VI (MVI) problem -- the weak dual of the SVI problem -- admits a solution. In this paper, we establish the first polynomial-time algorithm -- that is, with complexity growing polynomially in the dimension $d$ and $\log(1/ε)$ -- for solving $ε$-SVIs for Lipschitz continuous mappings under the Minty condition. Prior approaches either incurred an exponentially worse dependence on $1/ε$ (and other natural parameters of the problem) or made overly restrictive assumptions -- such as strong monotonicity. To do so, we introduce a new variant of the ellipsoid algorithm wherein separating hyperplanes are obtained after taking a gradient descent step from the center of the ellipsoid. It succeeds even though the set of SVIs can be nonconvex and not fully dimensional. Moreover, when our algorithm is applied to an instance with no MVI solution and fails to identify an SVI solution, it produces a succinct certificate of MVI infeasibility. We also show that deciding whether the Minty condition holds is $\mathsf{coNP}$-complete. We provide several extensions and new applications of our main results. Specifically, we obtain the first polynomial-time algorithms for i) solving monotone VIs, ii) globally minimizing a (potentially nonsmooth) quasar-convex function, and iii) computing Nash equilibria in multi-player harmonic games.
中文: 本文提出了首个在Minty条件下求解Stampacchia变分不等式的多项式时间算法,通过新型椭球方法处理非凸集,并在无解时提供不可行性证明。
English: This paper presents the first polynomial-time algorithm for solving Stampacchia variational inequalities under the Minty condition, using a novel ellipsoid method that handles nonconvex sets and provides infeasibility certificates when no solution exists.

Authors:Boce Hu, Heng Tian, Dian Wang, Haojie Huang, Xupeng Zhu, Robin Walters, Robert Platt
Title: Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization
Abstract:
Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging SE(2)-equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 49% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.
中文: 等变推抓网络利用SE(2)等变性和抓取分数优化策略,显著提升了杂乱环境中的机器人抓取成功率,在仿真和真实场景中分别比基线方法提高了49%和35%。
English: The Equivariant Push-Grasp Network leverages SE(2)-equivariance and a grasp score optimization strategy to significantly improve robotic grasping in cluttered environments, achieving 49% higher success rates in simulation and 35% in real-world tests compared to baselines.

Authors:Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, Hongtao Lu
Title: BECAME: BayEsian Continual Learning with Adaptive Model MErging
Abstract:
Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.
Chinese: 本文提出BECAME框架,通过结合梯度投影与自适应模型融合来优化持续学习中的稳定性与可塑性平衡,凭借理论分析和大量实验验证,其性能超越了现有最先进方法。
English: This paper introduces BECAME, a two-stage framework that combines gradient projection with adaptive model merging to enhance the stability-plasticity trade-off in continual learning, achieving superior performance over existing methods through theoretical insights and extensive experiments.

Authors:Zidong Yu, Shuo Wang, Nan Jiang, Weiqiang Huang, Xu Han, Junliang Du
Title: Improving Harmful Text Detection with Joint Retrieval and External Knowledge
Abstract:
Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.
中文: 本研究提出了一种结合预训练语言模型与知识图谱的联合检索框架,显著提升了有害文本检测的准确性和鲁棒性,尤其在资源匮乏和多语言环境中表现突出。
English: This study introduces a joint retrieval framework combining pre-trained language models and knowledge graphs to significantly enhance harmful text detection accuracy and robustness, particularly in low-resource and multilingual settings.

Authors:Mingshuai Yao, Mengting Chen, Qinye Zhou, Yabo Zhang, Ming Liu, Xiaoming Li, Shaohui Liu, Chen Ju, Shuai Xiao, Qingwen Liu, Jinsong Lan, Wangmeng Zuo
Title: Beyond Static Scenes: Camera-controllable Background Generation for Human Motion
Abstract:
In this paper, we investigate the generation of new video backgrounds given a human foreground video, a camera pose, and a reference scene image. This task presents three key challenges. First, the generated background should precisely follow the camera movements corresponding to the human foreground. Second, as the camera shifts in different directions, newly revealed content should appear seamless and natural. Third, objects within the video frame should maintain consistent textures as the camera moves to ensure visual coherence. To address these challenges, we propose DynaScene, a new framework that uses camera poses extracted from the original video as an explicit control to drive background motion. Specifically, we design a multi-task learning paradigm that incorporates auxiliary tasks, namely background outpainting and scene variation, to enhance the realism of the generated backgrounds. Given the scarcity of suitable data, we constructed a large-scale, high-quality dataset tailored for this task, comprising video foregrounds, reference scene images, and corresponding camera poses. This dataset contains 200K video clips, ten times larger than existing real-world human video datasets, providing a significantly richer and more diverse training resource. Project page: https://yaomingshuai.github.io/Beyond-Static-Scenes.github.io/
中文摘要:本文提出DynaScene框架,通过相机姿态控制生成与人物前景运动同步的动态视频背景,并构建了包含20万视频片段的大规模数据集以支持该任务。
English Summary: This paper introduces DynaScene, a framework that generates dynamic video backgrounds synchronized with human foreground movements using camera pose controls, supported by a newly created 200K video clip dataset.

Authors:Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg
Title: OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Abstract:
Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.
中文: 该研究构建了优质的监督微调数据集,通过指令多样性将推理能力蒸馏至学生模型,实现了顶尖的编程性能并开源了相关资源。
English: The study develops a superior supervised fine-tuning dataset to distill reasoning capabilities into student models, achieving state-of-the-art coding performance through instruction diversity and open-sourcing the resources.

Authors:Anuj Apte, Shree Hari Sureshbabu, Ruslan Shaydulin, Sami Boulebnane, Zichang He, Dylan Herman, James Sud, Marco Pistoia
Title: Iterative Interpolation Schedules for Quantum Approximate Optimization Algorithm
Abstract:
Quantum Approximate Optimization Algorithm (QAOA) is a promising quantum optimization heuristic with empirical evidence of speedup over classical state-of-the-art for some problems. QAOA solves optimization problems using a parameterized circuit with $p$ layers, with higher $p$ leading to better solutions. Existing methods require optimizing $2p$ independent parameters which is challenging for large $p$. In this work, we present an iterative interpolation method that exploits the smoothness of optimal parameter schedules by expressing them in a basis of orthogonal functions, generalizing Zhou et al. By optimizing a small number of basis coefficients and iteratively increasing both circuit depth and the number of coefficients until convergence, our approach enables construction of high-quality schedules for large $p$. We demonstrate our method achieves better performance with fewer optimization steps than current approaches on three problems: the Sherrington-Kirkpatrick (SK) model, portfolio optimization, and Low Autocorrelation Binary Sequences (LABS). For the largest LABS instance, we achieve near-optimal merit factors with schedules exceeding 1000 layers, an order of magnitude beyond previous methods. As an application of our technique, we observe a mild growth of QAOA depth sufficient to solve SK model exactly, a result of independent interest.
中文: 本文提出了一种基于正交函数的迭代插值方法,通过优化少量基函数系数并逐步增加电路深度,实现了对大规模问题的高质量求解,其优化效率远超现有方法。
English: This paper introduces an iterative interpolation method that uses orthogonal functions to optimize QAOA parameters efficiently, enabling high-performance solutions for large-scale problems with significantly fewer optimization steps than existing approaches.

Authors:Shu Han, Xubo Zhu, Ji Wu, Ximeng Cai, Wen Yang, Huai Yu, Gui-Song Xia
Title: UniCalib: Targetless LiDAR-Camera Calibration via Probabilistic Flow on Unified Depth Representations
Abstract:
Precise LiDAR-camera calibration is crucial for integrating these two sensors into robotic systems to achieve robust perception. In applications like autonomous driving, online targetless calibration enables a prompt sensor misalignment correction from mechanical vibrations without extra targets. However, existing methods exhibit limitations in effectively extracting consistent features from LiDAR and camera data and fail to prioritize salient regions, compromising cross-modal alignment robustness. To address these issues, we propose DF-Calib, a LiDAR-camera calibration method that reformulates calibration as an intra-modality depth flow estimation problem. DF-Calib estimates a dense depth map from the camera image and completes the sparse LiDAR projected depth map, using a shared feature encoder to extract consistent depth-to-depth features, effectively bridging the 2D-3D cross-modal gap. Additionally, we introduce a reliability map to prioritize valid pixels and propose a perceptually weighted sparse flow loss to enhance depth flow estimation. Experimental results across multiple datasets validate its accuracy and generalization,with DF-Calib achieving a mean translation error of 0.635cm and rotation error of 0.045 degrees on the KITTI dataset.
中文: DF-Calib是一种创新的激光雷达-相机标定方法,通过共享特征编码和可靠性映射将跨模态对齐转化为深度流估计问题,在KITTI数据集上实现了0.635厘米平移误差和0.045度旋转误差的顶尖精度。
English: DF-Calib is a novel LiDAR-camera calibration method that transforms cross-modal alignment into depth flow estimation using shared feature encoding and reliability mapping, achieving state-of-the-art accuracy with 0.635cm translation and 0.045° rotation errors on KITTI.

Authors:Mohan Zhang, Pingzhi Li, Jie Peng, Mufan Qiu, Tianlong Chen
Title: Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design
Abstract:
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.
中文:混合专家模型因专家激活不均衡和通信开销大而效率受限,本文提出协作约束路由策略,通过促进专家专业化来提高性能,在现有技术基础上额外节省20%-30%的运行时间。
English: The Mixture-of-Experts (MoE) model faces efficiency challenges from imbalanced expert activation and high communication overhead, which this paper addresses by introducing a collaboration-constrained routing strategy that promotes expert specialization, improving performance and reducing runtime by 20%-30%.

Authors:Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana
Title: Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
Abstract:
We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.
中文摘要:该框架通过动态检索优化和强化微调增强检索增强生成系统,在知识密集型任务中显著提高事实准确性及效率,同时有效减少幻觉现象。
English Summary: This framework enhances Retrieval-Augmented Generation systems through dynamic retrieval optimization and reinforcement fine-tuning, significantly improving factual accuracy and efficiency in knowledge-intensive tasks while reducing hallucinations.

Authors:Elif Beray Sariisik, Melih Bastopcu, Nail Akar, Sennur Ulukus
Title: How to Maximize Efficiency in Systems with Exhausted Workers
Abstract:
We consider the problem of assigning tasks efficiently to a set of workers that can exhaust themselves as a result of processing tasks. If a worker is exhausted, it will take a longer time to recover. To model efficiency of workers with exhaustion, we use a continuous-time Markov chain (CTMC). By taking samples from the internal states of the workers, the source assigns tasks to the workers when they are found to be in their efficient states. We consider two different settings where (i) the source can assign tasks to the workers only when they are in their most efficient state, and (ii) it can assign tasks to workers when they are also moderately efficient in spite of a potentially reduced success probability. In the former case, we find the optimal policy to be a threshold-based sampling policy where the thresholds depend on the workers' recovery and exhaustion rates. In the latter case, we solve a non-convex sum-of-ratios problem using a branch-and-bound approach which performs well compared with the globally optimal solution.
中文摘要:该研究利用连续时间马尔可夫链模拟工人疲劳状态,提出任务分配策略,包括基于阈值的采样策略和通过分支定界法处理中等效率状态下的任务分配优化问题。
English Summary: This study models worker exhaustion using a continuous-time Markov chain and proposes task assignment strategies, including a threshold-based policy for optimal efficiency and a branch-and-bound method for handling reduced success probabilities in moderately efficient states.

Authors:Lei Wang, Yujie Zhong, Xiaopeng Sun, Jingchun Cheng, Chengjian Feng, Qiong Cao, Lin Ma, Zhaoxin Fan
Title: AP-CAP: Advancing High-Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline
Abstract:
The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.
中文摘要:本研究提出AP-CAP可控图像生成流程,通过多模态生成模型和三项创新策略合成动物姿态估计数据,构建首个混合真实与合成数据的MPCH数据集,有效解决了数据稀缺问题并显著提升了姿态估计器的性能。
English Summary: This study introduces AP-CAP, a controllable image generation pipeline that synthesizes animal pose estimation data through multi-modal generation and three innovative strategies, creating the hybrid MPCH dataset to overcome dataset scarcity and enhance model performance.

Authors:Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana
Title: Agentic Multimodal AI for Hyperpersonalized B2B and B2C Advertising in Competitive Markets: An AI-Driven Competitive Advertising Framework
Abstract:
The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hyper-personalized advertising in B2B and B2C markets. By integrating retrieval-augmented generation (RAG), multimodal reasoning, and adaptive persona-based targeting, our system generates culturally relevant, market-aware ads tailored to shifting consumer behaviors and competition. Validation combines real-world product experiments with a Simulated Humanistic Colony of Agents to model consumer personas, optimize strategies at scale, and ensure privacy compliance. Synthetic experiments mirror real-world scenarios, enabling cost-effective testing of ad strategies without risky A/B tests. Combining structured retrieval-augmented reasoning with in-context learning (ICL), the framework boosts engagement, prevents market cannibalization, and maximizes ROAS. This work bridges AI-driven innovation and market adoption, advancing multimodal FM deployment for high-stakes decision-making in commercial marketing.
中文:本研究提出了一种多语言、多模态AI框架,通过检索增强生成和自适应角色定位技术,为动态市场生成超个性化、文化契合的广告,并借助真实场景实验和模拟消费者模型验证,有效提升用户参与度和广告投资回报率。
English: This study introduces a multilingual, multimodal AI framework that leverages retrieval-augmented generation and adaptive persona-based targeting to create hyper-personalized, culturally relevant advertisements for dynamic markets, validated through real-world experiments and simulated consumer modeling to enhance engagement and return on ad spend.

Authors:Leonhard Sommer, Olaf Dünkel, Christian Theobalt, Adam Kortylewski
Title: Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Abstract:
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
中文:Common3D提出了一种从视频中自监督学习三维可形变模型的新方法,通过神经特征和对比学习提升了对常见物体的三维姿态和语义对应关系的估计能力。
English: Common3D introduces a self-supervised method to learn 3D morphable models from videos, using neural features and contrastive learning for improved 3D pose and correspondence estimation across common objects.

Authors:Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Vinod P
Title: XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs
Abstract:
Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.
中文摘要:大型语言模型在关键领域应用时面临安全威胁,为此提出的XBreaking攻击利用可解释性分析揭示对齐模式,通过针对性噪声注入有效突破模型的安全限制。
English Summary: Large Language Models face security threats that hinder their adoption in critical sectors, leading to the development of XBreaking, an explainable jailbreak attack that exploits alignment patterns through targeted noise injection to breach model constraints.

Authors:Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Vinod P
Title: XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs
Abstract:
Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.
中文摘要:大型语言模型在关键领域应用时面临安全威胁,为此提出的XBreaking攻击利用可解释性分析揭示对齐模式,通过针对性噪声注入有效突破模型的安全限制。
English Summary: Large Language Models face security threats that hinder their adoption in critical sectors, leading to the development of XBreaking, an explainable jailbreak attack that exploits alignment patterns through targeted noise injection to breach model constraints.

Authors:Haoyan Xu, Zhengtao Yao, Ziyi Wang, Zhan Cheng, Xiyang Hu, Mengyuan Li, Yue Zhao
Title: Graph Synthetic Out-of-Distribution Exposure with Large Language Models
Abstract:
Out-of-distribution (OOD) detection in graphs is critical for ensuring model robustness in open-world and safety-sensitive applications. Existing graph OOD detection approaches typically train an in-distribution (ID) classifier on ID data alone, then apply post-hoc scoring to detect OOD instances. While OOD exposure - adding auxiliary OOD samples during training - can improve detection, current graph-based methods often assume access to real OOD nodes, which is often impractical or costly. In this paper, we present GOE-LLM, a framework that leverages Large Language Models (LLMs) to achieve OOD exposure on text-attributed graphs without using any real OOD nodes. GOE-LLM introduces two pipelines: (1) identifying pseudo-OOD nodes from the initially unlabeled graph using zero-shot LLM annotations, and (2) generating semantically informative synthetic OOD nodes via LLM-prompted text generation. These pseudo-OOD nodes are then used to regularize ID classifier training and enhance OOD detection awareness. Empirical results on multiple benchmarks show that GOE-LLM substantially outperforms state-of-the-art methods without OOD exposure, achieving up to a 23.5% improvement in AUROC for OOD detection, and attains performance on par with those relying on real OOD labels for exposure.
中文摘要:GOE-LLM是一种创新框架,利用大型语言模型生成合成分布外节点来增强文本属性图的异常检测,在不依赖真实分布外数据的情况下实现了显著性能提升。
English Summary: GOE-LLM is a novel framework that leverages Large Language Models to generate synthetic out-of-distribution nodes for enhanced OOD detection on text-attributed graphs, achieving significant performance improvements without requiring real OOD data.

Authors:Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
Title: TesserAct: Learning 4D Embodied World Models
Abstract:
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
中文: 本文提出了一种利用RGB-DN视频数据学习4D具身世界模型的有效方法,能够实现时空一致的场景预测,并在策略学习方面显著优于传统模型。
English: This paper introduces an effective method for learning 4D embodied world models using RGB-DN video data, enabling temporally and spatially coherent scene predictions and superior policy learning compared to traditional models.

Authors:Beizhe Hu, Qiang Sheng, Juan Cao, Yang Li, Danding Wang
Title: LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation
Abstract:
Online fake news moderation now faces a new challenge brought by the malicious use of large language models (LLMs) in fake news production. Though existing works have shown LLM-generated fake news is hard to detect from an individual aspect, it remains underexplored how its large-scale release will impact the news ecosystem. In this study, we develop a simulation pipeline and a dataset with ~56k generated news of diverse types to investigate the effects of LLM-generated fake news within neural news recommendation systems. Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news as LLM-generated news is involved in news recommendation. We further provide an explanation about why truth decay occurs from a familiarity perspective and show the positive correlation between perplexity and news ranking. Finally, we discuss the threats of LLM-generated fake news and provide possible countermeasures. We urge stakeholders to address this emerging challenge to preserve the integrity of news ecosystems.
中文摘要:研究发现,大型语言模型生成的虚假新闻在推荐系统中引发“真相衰退”现象,导致真实新闻排名优势丧失,亟需采取应对措施以维护新闻生态系统的完整性。
English Summary: The study reveals that large language model-generated fake news causes "truth decay" in recommendation systems, where real news loses ranking advantage, and calls for urgent countermeasures to protect news ecosystem integrity.

Authors:Yiyang Peng, Hongyu Li, Zheyu Wu, Bruno Clerckx
Title: Lossy Beyond Diagonal Reconfigurable Intelligent Surfaces: Modeling and Optimization
Abstract:
Beyond diagonal reconfigurable intelligent surface (BD-RIS) has emerged as an advancement and generalization of the conventional diagonal RIS (D-RIS) by introducing tunable interconnections between RIS elements, enabling smarter wave manipulation and enlarged coverage. While BD-RIS has demonstrated advantages over D-RIS in various aspects, most existing works rely on the assumption of a lossless model, leaving practical considerations unaddressed. This paper thus proposes a lossy BD-RIS model and develops corresponding optimization algorithms for various BD-RIS-aided communication systems. First, by leveraging admittance parameter analysis, we model each tunable admittance based on a lumped circuit with losses and derive an expression of a circle characterizing the real and imaginary parts of each tunable admittance. We then consider the received signal power maximization in single-user single-input single-output (SISO) systems with the proposed lossy BD-RIS model. To solve the optimization problem, we design an effective algorithm by carefully exploiting the problem structure. Specifically, an alternating direction method of multipliers (ADMM) framework is custom-designed to deal with the complicated constraints associated with lossy BD-RIS. Furthermore, we extend the proposed algorithmic framework to more general multiuser multiple-input single-output (MU-MISO) systems, where the transmit precoder and BD-RIS scattering matrix are jointly designed to maximize the sum-rate of the system. Finally, simulation results demonstrate that all BD-RIS architectures still outperform D-RIS in the presence of losses, but the optimal BD-RIS architectures in the lossless case are not necessarily optimal in the lossy case, e.g. group-connected BD-RIS can outperform fully- and tree-connected BD-RISs in SISO systems with relatively high losses, whereas the opposite always holds true in the lossless case.
中文: 本文提出了有损超对角可重构智能表面(BD-RIS)模型,并为单用户和多用户系统开发了优化算法,结果表明尽管BD-RIS始终优于传统对角RIS,但最优架构会随损耗条件变化。
English: This paper introduces a lossy model for beyond diagonal reconfigurable intelligent surfaces (BD-RIS) and develops optimization algorithms for single-user and multiuser systems, demonstrating that while BD-RIS consistently outperforms conventional diagonal RIS, the optimal architecture varies with loss conditions.

Authors:Chen Su, Yuanhe Tian, Yan Song
Title: Multimodal Conditioned Diffusive Time Series Forecasting
Abstract:
Diffusion models achieve remarkable success in processing images and text, and have been extended to special domains such as time series forecasting (TSF). Existing diffusion-based approaches for TSF primarily focus on modeling single-modality numerical sequences, overlooking the rich multimodal information in time series data. To effectively leverage such information for prediction, we propose a multimodal conditioned diffusion model for TSF, namely, MCD-TSF, to jointly utilize timestamps and texts as extra guidance for time series modeling, especially for forecasting. Specifically, Timestamps are combined with time series to establish temporal and semantic correlations among different data points when aggregating information along the temporal dimension. Texts serve as supplementary descriptions of time series' history, and adaptively aligned with data points as well as dynamically controlled in a classifier-free manner. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed MCD-TSF model achieves state-of-the-art performance.
中文: 提出的MCD-TSF模型通过结合时间戳和文本描述作为多模态指导来改进时间序列预测,在多个现实世界数据集中实现了最优性能。
English: The proposed MCD-TSF model enhances time series forecasting by integrating timestamps and textual descriptions as multimodal guidance, achieving state-of-the-art results across diverse real-world datasets.

Authors:Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, Heikki Kälviäinen
Title: FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
Abstract:
Figure skating, known as the "Art on Ice," is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models' understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.
中文: 摘要介绍了FSAnno数据集,该数据集通过整合花样滑冰的技术与艺术标注填补了现有研究空白,并推出FSBench评估基准,揭示了当前人工智能在理解艺术类运动方面存在的明显不足。
English: The abstract introduces FSAnno, a comprehensive dataset addressing the lack of integrated technical and artistic annotations in figure skating research, featuring FSBench for evaluating model performance and revealing current AI limitations in understanding artistic sports.

Authors:Shoujie Li, Jianle Xu, Tong Wu, Yang Yang, Yanbo Chen, Xueqian Wang, Wenbo Ding, Xiao-Ping Zhang
Title: VTire: A Bimodal Visuotactile Tire with High-Resolution Sensing Capability
Abstract:
Developing smart tires with high sensing capability is significant for improving the moving stability and environmental adaptability of wheeled robots and vehicles. However, due to the classical manufacturing design, it is always challenging for tires to infer external information precisely. To this end, this paper introduces a bimodal sensing tire, which can simultaneously capture tactile and visual data. By leveraging the emerging visuotactile techniques, the proposed smart tire can realize various functions, including terrain recognition, ground crack detection, load sensing, and tire damage detection. Besides, we optimize the material and structure of the tire to ensure its outstanding elasticity, toughness, hardness, and transparency. In terms of algorithms, a transformer-based multimodal classification algorithm, a load detection method based on finite element analysis, and a contact segmentation algorithm have been developed. Furthermore, we construct an intelligent mobile platform to validate the system's effectiveness and develop visual and tactile datasets in complex terrains. The experimental results show that our multimodal terrain sensing algorithm can achieve a classification accuracy of 99.2\%, a tire damage detection accuracy of 97\%, a 98\% success rate in object search, and the ability to withstand tire loading weights exceeding 35 kg. In addition, we open-source our algorithms, hardware, and datasets at https://sites.google.com/view/vtire.
中文: 本文提出了一种双模态传感轮胎,通过同步采集触觉与视觉数据实现地形识别、地面裂缝检测、载荷感知及轮胎损伤检测,其优化的材料与算法在实验中展现出卓越性能,相关资源已开源共享。
English: This paper presents a bimodal sensing tire that captures tactile and visual data to enable terrain recognition, crack detection, load sensing, and damage detection with high accuracy, while its optimized material and algorithms achieve robust performance validated through extensive experiments.

Authors:Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, Daoyuan Wu
Title: BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts
Abstract:
Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs), enabling efficient scaling of model capacity while maintaining manageable computational costs. The key advantage lies in their ability to route different tokens to different ``expert'' networks within the model, enabling specialization and efficient handling of diverse input. However, the vulnerabilities of MoE-based LLMs still have barely been studied, and the potential for backdoor attacks in this context remains largely unexplored. This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison ``dormant experts'' (i.e., underutilized experts) and activate them by optimizing routing triggers, thereby gaining control over the model's output. We first rigorously prove the existence of a few ``dominating experts'' in MoE models, whose outputs can determine the overall MoE's output. We also show that dormant experts can serve as dominating experts to manipulate model predictions. Accordingly, our attack, namely BadMoE, exploits the unique architecture of MoE models by 1) identifying dormant experts unrelated to the target task, 2) constructing a routing-aware loss to optimize the activation triggers of these experts, and 3) promoting dormant experts to dominating roles via poisoned training data. Extensive experiments show that BadMoE successfully enforces malicious prediction on attackers' target tasks while preserving overall model utility, making it a more potent and stealthy attack than existing methods.
中文:混合专家(MoE)模型面临如BadMoE的后门攻击风险,该攻击通过毒化休眠专家来控制输出同时保持模型功能,揭示了其架构中的重大安全隐患。
English: Mixture-of-Experts (MoE) models face vulnerabilities from backdoor attacks like BadMoE, which poison dormant experts to control outputs while maintaining model utility, highlighting a critical security risk in their architecture.

Authors:Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Haoran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, Rui Zhang
Title: HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Abstract:
High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.
中文: HRScene作为高分辨率图像理解的综合基准,揭示了当前视觉语言模型仅能达到约50%的准确率,且在有效利用图像区域方面存在明显不足。
English: HRScene is introduced as a comprehensive benchmark for high-resolution image understanding, revealing that current vision language models achieve only around 50% accuracy and struggle with effectively utilizing image regions.

Authors:Yi-Xing Peng, Qize Yang, Yu-Ming Tang, Shenghao Fu, Kun-Yu Lin, Xihan Wei, Wei-Shi Zheng
Title: ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Abstract:
Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each accompanied by detailed annotations that meticulously label every limb movement. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions. Experimental results indicate that, while current large multimodal models perform commendably on various tasks, they often fall short in achieving fine-grained understanding. We attribute this limitation to the scarcity of meticulously annotated data, which is both costly and difficult to scale manually. Since manual annotations are costly and hard to scale, we propose proxy tasks to enhance the model perception ability in both spatial and temporal dimensions. These proxy tasks are carefully crafted to be driven by data automatically generated from existing MLLMs, thereby reducing the reliance on costly manual labels. Experimental results show that the proposed proxy tasks significantly narrow the gap toward the performance achieved with manually annotated fine-grained data.
中文摘要:ActionArt视频数据集旨在推动细粒度人体动作理解研究,发现现有多模态模型因缺乏精细标注数据而难以实现细节感知,并提出通过自动生成的代理任务显著提升模型性能,有效减少对人工标注的依赖。
English Summary: The ActionArt dataset is introduced to advance fine-grained human action understanding in videos, revealing that current multimodal models struggle with detailed comprehension due to limited annotated data, and proposes automated proxy tasks that significantly improve performance without costly manual labeling.

Authors:Chen Chen, Daochang Liu, Mubarak Shah, Chang Xu
Title: Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Abstract:
Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
中文: 文本到图像扩散模型存在记忆训练数据的问题,引发隐私和原创性担忧,而新的PRSS方法通过提示重锚定和语义搜索提升了隐私保护和实用性,实现了更优的平衡。
English: Text-to-image diffusion models face challenges with memorizing training data, raising privacy and originality concerns, but the new PRSS method enhances both privacy and utility through prompt re-anchoring and semantic search, achieving a superior trade-off.

Authors:Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, WenQi Ren
Title: Dual Prompting Image Restoration with Diffusion Transformers
Abstract:
Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.
中文摘要:DPIR通过双分支结构,结合图像先验与全局-局部视觉提示来增强扩散变换器,在图像修复中实现了超越纯文本条件方法的卓越性能。
English Summary: DPIR introduces a dual-branch approach combining image priors with global-local visual prompts to enhance diffusion transformers, achieving superior image restoration beyond text-only conditioning methods.

Authors:Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo
Title: Subject-driven Video Generation via Disentangled Identity and Motion
Abstract:
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
Chinese: 我们提出了一种零样本视频定制模型,通过图像数据集进行身份注入,并利用未标注视频保持时序建模,将主体特定学习与时序动态解耦,无需额外调优即可实现优异的主体一致性。
English: We introduce a zero-shot video customization model that separates subject-specific learning from temporal dynamics using an image dataset for identity injection and unannotated videos for temporal modeling, achieving superior subject consistency without additional tuning.

Authors:Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
Title: Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Abstract:
Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
Chinese: Token-Shuffle方法通过沿通道合并局部标记并在处理后恢复空间布局,减少自回归模型中的图像标记数量,实现了高效的2048x2048分辨率图像合成,在基准测试和人工评估中均优于现有模型。
English: The Token-Shuffle method reduces image tokens in autoregressive models by merging local tokens along channels and restoring spatial arrangements after processing, enabling efficient 2048x2048 resolution image synthesis that outperforms existing models in benchmarks and human evaluations.

Authors:Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan
Title: Multilingual Performance Biases of Large Language Models in Education
Abstract:
Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
中文摘要:大型语言模型在非英语教育任务中表现参差不齐,常显著低于英语水平,因此建议部署前先验证其在目标语言中的有效性。
English Summary: Large language models perform variably across educational tasks in non-English languages, often showing significant drops from English performance, so practitioners should verify their effectiveness in the target language before deployment.

Authors:Hongshu Guo, Wenjie Qiu, Zeyuan Ma, Xinglin Zhang, Jun Zhang, Yue-Jiao Gong
Title: Advancing CMA-ES with Learning-Based Cooperative Coevolution for Scalable Optimization
Abstract:
Recent research in Cooperative Coevolution~(CC) have achieved promising progress in solving large-scale global optimization problems. However, existing CC paradigms have a primary limitation in that they require deep expertise for selecting or designing effective variable decomposition strategies. Inspired by advancements in Meta-Black-Box Optimization, this paper introduces LCC, a pioneering learning-based cooperative coevolution framework that dynamically schedules decomposition strategies during optimization processes. The decomposition strategy selector is parameterized through a neural network, which processes a meticulously crafted set of optimization status features to determine the optimal strategy for each optimization step. The network is trained via the Proximal Policy Optimization method in a reinforcement learning manner across a collection of representative problems, aiming to maximize the expected optimization performance. Extensive experimental results demonstrate that LCC not only offers certain advantages over state-of-the-art baselines in terms of optimization effectiveness and resource consumption, but it also exhibits promising transferability towards unseen problems.
中文摘要:本文提出LCC这一基于学习的协同进化框架,通过强化学习训练的神经网络在优化过程中动态选择分解策略,实验证明其在优化效果和迁移性方面均优于现有先进方法。
English Summary: This paper introduces LCC, a learning-based cooperative coevolution framework that uses a neural network trained with reinforcement learning to dynamically select decomposition strategies during optimization, demonstrating superior performance and transferability over existing methods.

Authors:Ivor van der Hoog, Thijs van der Horst, Eva Rotenberg, Lasse Wulf
Title: Fréchet Distance in Unweighted Planar Graphs
Abstract:
The Fréchet distance is a distance measure between trajectories in $\Bbb{R}^d$ or walks in a graph $G$. Given constant-time shortest path queries, the Discrete Fréchet distance $D_G(P, Q)$ between two walks $P$ and $Q$ can be computed in $O(|P| \cdot |Q|)$ time using a dynamic program. Driemel, van der Hoog, and Rotenberg [SoCG'22] show that for weighted planar graphs this approach is likely tight, as there can be no strongly-subquadratic algorithm to compute a $1.01$-approximation of $D_G(P, Q)$ unless the Orthogonal Vector Hypothesis (OVH) fails. Such quadratic-time conditional lower bounds are common to many Fréchet distance variants. However, they can be circumvented by assuming that the input comes from some well-behaved class: There exist $(1+\varepsilon)$-approximations, both in weighted graphs and in $\Bbb{R}^d$, that take near-linear time for $c$-packed or $κ$-straight walks in the graph. In $\Bbb{R}^d$ there also exists a near-linear time algorithm to compute the Fréchet distance whenever all input edges are long compared to the distance. We consider computing the Fréchet distance in unweighted planar graphs. We show that there exist no strongly-subquadratic $1.25$-approximations of the discrete Fréchet distance between two disjoint simple paths in an unweighted planar graph in strongly subquadratic time, unless OVH fails. This improves the previous lower bound, both in terms of generality and approximation factor. We subsequently show that adding graph structure circumvents this lower bound: If the graph is a regular tiling with unit-weighted edges, then there exists an $\tilde{O}((|P| + |Q|)^{1.5})$-time algorithm to compute $D_G(P, Q)$. Our result has natural implications in the plane, as it allows us to define a new class of well-behaved curves that facilitate $(1+\varepsilon)$-approximations of their discrete Fréchet distance in subquadratic time.
中文: 除非正交向量假设不成立,否则在无权平面图中无法以强次二次时间计算离散弗雷歇距离的1.25倍近似解,但在规则铺砌等结构化图中存在高效算法。
English: The discrete Fréchet distance in unweighted planar graphs cannot be approximated within a factor of 1.25 in strongly subquadratic time unless the Orthogonal Vector Hypothesis fails, yet efficient algorithms exist for structured graphs like regular tilings.

Authors:Ivor van der Hoog, Eva Rotenberg, Daniel Rutschmann
Title: Simple Universally Optimal Dijkstra
Abstract:
Let G be a weighted (directed) graph with n vertices and m edges. Given a source vertex s, Dijkstra's algorithm computes the shortest path lengths from s to all other vertices in O(m + n log n) time. This bound is known to be worst-case optimal via a reduction to sorting. Theoretical computer science has developed numerous fine-grained frameworks for analyzing algorithmic performance beyond standard worst-case analysis, such as instance optimality and output sensitivity. Haeupler et al. [FOCS '24] consider the notion of universal optimality, a refined complexity measure that accounts for both the graph topology and the edge weights. For a fixed graph topology, the universal running time of a weighted graph algorithm is defined as its worst-case running time over all possible edge weightings of G. An algorithm is universally optimal if no other algorithm achieves a better asymptotic universal running time on any particular graph topology. They show that Dijkstra's algorithm can be made universally optimal by replacing the heap with a custom data structure. We revisit their result. We introduce a simple heap property called timestamp optimality, where the cost of popping an element x is logarithmic in the number of elements inserted between pushing and popping x. We show that timestamp optimal heaps are not only easier to define but also easier to implement. Using these timestamps, we provide a significantly simpler proof that Dijkstra's algorithm, with the right kind of heap, is universally optimal.
中文: 作者提出了一种简单的时间戳最优性堆属性,不仅简化了实现,还更简洁地证明了在采用合适堆结构时,Dijkstra算法能够达到通用最优性。
English: The authors introduce a simple timestamp optimality property for heaps, which simplifies both the implementation and the proof that Dijkstra's algorithm can achieve universal optimality when using an appropriate heap structure.

Authors:Rishav Pramanik, Antoine Poupon, Juan A. Rodriguez, Masih Aminbeidokhti, David Vazquez, Christopher Pal, Zhaozheng Yin, Marco Pedersoli
Title: Distilling semantically aware orders for autoregressive image generation
Abstract:
Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.
中文: 本文提出了一种新的自回归图像生成方法,通过训练模型学习最优补丁生成顺序而非固定光栅扫描顺序,在无需额外训练成本或标注的情况下实现了更高质量的图像生成。
English: This paper introduces a novel autoregressive image generation method that learns optimal patch generation orders instead of relying on fixed raster-scan sequences, resulting in higher quality images without additional training costs or annotations.

Authors:Italo Santos, Katia Romero Felizardo, Bianca Trinkereinch, Daniel M. German, Igor Steinmacher, Marco A. Gerosa
Title: Exploring the Untapped: Student Perceptions and Participation in OSS
Abstract:
Open Source Software (OSS) projects offer valuable opportunities to train the next generation of software engineers while benefiting projects and society as a whole. While research has extensively explored student participation in OSS and its use in software engineering education, student participation in OSS is still low, and the perspectives of students who have never contributed remain underexplored. This study aims to investigate the relationship between students' interest in contributing to OSS and their perceptions of barriers and motivational factors. We developed a theoretical model to understand the relationship between students' perceptions of OSS and their interest in contributing. We then surveyed students majoring in computer science and related fields (N=241). Using structural equation modeling techniques, we tested the model and found that intrinsic and internalized extrinsic motivations are positively associated with interest in contributing to OSS projects, while the impact of extrinsic motivation varies by gender. Comparatively, we found no significant relationship between barriers and interest in contributing. Students suggested several ways to make projects more attractive, including increasing awareness of the importance of OSS. Our findings can help communities better prepare to integrate students and encourage educators to enhance interest in OSS by linking participation to specific motivational factors.
中文摘要:本研究探讨计算机专业学生对开源软件的贡献兴趣与动机及障碍认知的关系,发现内在动机显著促进参与意愿而障碍影响甚微,且外在动机的作用存在性别差异。
English Summary: This study explores how computer science students' motivations and perceived barriers influence their interest in contributing to open source software, revealing that intrinsic motivations significantly drive participation while barriers show minimal impact, with gender differences observed in extrinsic motivation effects.

Authors:Amber Xie, Oleh Rybkin, Dorsa Sadigh, Chelsea Finn
Title: Latent Diffusion Planning for Imitation Learning
Abstract:
Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations. To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage action-free demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.
Chinese: 潜在扩散规划(LDP)是一种模块化模仿学习方法,通过在学习的潜在空间中结合规划器和逆动力学模型,有效利用次优和无动作数据,在视觉机器人任务中超越了现有方法。
English: Latent Diffusion Planning (LDP) is a modular imitation learning method that uses a planner and inverse dynamics model in a learned latent space to effectively leverage both suboptimal and action-free data, outperforming existing approaches on visual robotic tasks.

Authors:Ernestine Großmann, Ivor van der Hoog, Henrik Reinstädtler, Eva Rotenberg, Christian Schulz, Juliette Vlieghe
Title: From Theory to Practice: Engineering Approximation Algorithms for Dynamic Orientation
Abstract:
Dynamic graph algorithms have seen significant theoretical advancements, but practical evaluations often lag behind. This work bridges the gap between theory and practice by engineering and empirically evaluating recently developed approximation algorithms for dynamically maintaining graph orientations. We comprehensively describe the underlying data structures, including efficient bucketing techniques and round-robin updates. Our implementation has a natural parameter $λ$, which allows for a trade-off between algorithmic efficiency and the quality of the solution. In the extensive experimental evaluation, we demonstrate that our implementation offers a considerable speedup. Using different quality metrics, we show that our implementations are very competitive and can outperform previous methods. Overall, our approach solves more instances than other methods while being up to 112 times faster on instances that are solvable by all methods compared.
本研究通过工程实现和评估动态图定向算法,弥合了理论与实践之间的差距,利用可调参数展示了显著的速度提升和具有竞争力的求解质量。
This study bridges the gap between theory and practice by engineering and evaluating dynamic graph orientation algorithms, demonstrating significant speed improvements and competitive solution quality through a tunable parameter.

Authors:Hong Ting Tsang, Zihao Wang, Yangqiu Song
Title: Transformers for Complex Query Answering over Knowledge Hypergraphs
Abstract:
Complex Query Answering (CQA) has been extensively studied in recent years. In order to model data that is closer to real-world distribution, knowledge graphs with different modalities have been introduced. Triple KGs, as the classic KGs composed of entities and relations of arity 2, have limited representation of real-world facts. Real-world data is more sophisticated. While hyper-relational graphs have been introduced, there are limitations in representing relationships of varying arity that contain entities with equal contributions. To address this gap, we sampled new CQA datasets: JF17k-HCQA and M-FB15k-HCQA. Each dataset contains various query types that include logical operations such as projection, negation, conjunction, and disjunction. In order to answer knowledge hypergraph (KHG) existential first-order queries, we propose a two-stage transformer model, the Logical Knowledge Hypergraph Transformer (LKHGT), which consists of a Projection Encoder for atomic projection and a Logical Encoder for complex logical operations. Both encoders are equipped with Type Aware Bias (TAB) for capturing token interactions. Experimental results on CQA datasets show that LKHGT is a state-of-the-art CQA method over KHG and is able to generalize to out-of-distribution query types.
Chinese: 为解决现有知识图谱在表示复杂现实数据方面的局限性,本研究引入了新数据集并提出了逻辑知识超图变换器(LKHGT),该两阶段模型在复杂查询回答中实现了最先进性能,并能泛化至未见过的查询类型。
English: To address the limitations of existing knowledge graphs in representing complex real-world data, this study introduces new datasets and proposes the Logical Knowledge Hypergraph Transformer (LKHGT), a two-stage model that achieves state-of-the-art performance in complex query answering while generalizing to unseen query types.

Authors:Jiaping Tang, Jianan Mu, Silin Liu, Zizhen Liu, Feng Gu, Xinyu Zhang, Leyan Wang, Shenwen Liang, Jing Ye, Huawei Li, Xiaowei Li
Title: ERASER: Efficient RTL FAult Simulation Framework with Trimmed Execution Redundancy
Abstract:
As intelligent computing devices increasingly integrate into human life, ensuring the functional safety of the corresponding electronic chips becomes more critical. A key metric for functional safety is achieving a sufficient fault coverage. To meet this requirement, extensive time-consuming fault simulation of the RTL code is necessary during the chip design phase.The main overhead in RTL fault simulation comes from simulating behavioral nodes (always blocks). Due to the limited fault propagation capacity, fault simulation results often match the good simulation results for many behavioral nodes. A key strategy for accelerating RTL fault simulation is the identification and elimination of redundant simulations. Existing methods detect redundant executions by examining whether the fault inputs to each RTL node are consistent with the good inputs. However, we observe that this input comparison mechanism overlooks a significant amount of implicit redundant execution: although the fault inputs differ from the good inputs, the node's execution results remain unchanged. Our experiments reveal that this overlooked redundant execution constitutes nearly half of the total execution overhead of behavioral nodes, becoming a significant bottleneck in current RTL fault simulation. The underlying reason for this overlooked redundancy is that, in these cases, the true execution paths within the behavioral nodes are not affected by the changes in input values. In this work, we propose a behavior-level redundancy detection algorithm that focuses on the true execution paths. Building on the elimination of redundant executions, we further developed an efficient RTL fault simulation framework, Eraser.Experimental results show that compared to commercial tools, under the same fault coverage, our framework achieves a 3.9 $\times$ improvement in simulation performance on average.
中文: 针对RTL故障模拟中因冗余执行导致的效率低下问题,本研究提出一种行为级冗余检测算法和Eraser框架,通过关注真实执行路径显著提升了模拟性能。
English: To address the inefficiency in RTL fault simulation caused by redundant executions, this study introduces a behavior-level redundancy detection algorithm and the Eraser framework, which significantly improves simulation performance by focusing on true execution paths.

Authors:Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui
Title: Describe Anything: Detailed Localized Image and Video Captioning
Abstract:
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
中文:描述任意模型(DAM)通过保留局部细节和全局上下文,在详细区域描述方面引入创新,并借助半监督数据管道在多个基准测试中取得了最先进的成果。
English: The Describe Anything Model (DAM) introduces innovations for detailed localized captioning by preserving local details and global context, achieving state-of-the-art results across multiple benchmarks through a semi-supervised data pipeline.

Authors:Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Title: Towards Understanding Camera Motions in Any Video
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
中文摘要:CameraBench是一个用于评估和改进摄像机运动理解的大规模数据集和基准,包含专家标注的视频和运动基元分类法,揭示了现有模型的局限性,并通过微调实现了性能提升。
English Summary: CameraBench is a comprehensive dataset and benchmark for evaluating camera motion understanding, featuring expert-annotated videos and a taxonomy of motion primitives that reveals the limitations of current models and enables improved performance through fine-tuning.

Authors:Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Title: Towards Understanding Camera Motions in Any Video
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
中文摘要:CameraBench是一个用于评估和改进摄像机运动理解的大规模数据集和基准,包含专家标注的视频和运动基元分类法,揭示了现有模型的局限性,并通过微调实现了性能提升。
English Summary: CameraBench is a comprehensive dataset and benchmark for evaluating camera motion understanding, featuring expert-annotated videos and a taxonomy of motion primitives that reveals the limitations of current models and enables improved performance through fine-tuning.

Authors:Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, Yi Yang
Title: Insert Anything: Image Insertion via In-Context Editing in DiT
Abstract:
This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. Instead of training separate models for individual tasks, our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios. Such a challenging setting requires capturing both identity features and fine-grained details, while allowing versatile local adaptations in style, color, and texture. To this end, we propose to leverage the multimodal attention of the Diffusion Transformer (DiT) to support both mask- and text-guided editing. Furthermore, we introduce an in-context editing mechanism that treats the reference image as contextual information, employing two prompting strategies to harmonize the inserted elements with the target scene while faithfully preserving their distinctive features. Extensive experiments on AnyInsertion, DreamBooth, and VTON-HD benchmarks demonstrate that our method consistently outperforms existing alternatives, underscoring its great potential in real-world applications such as creative content generation, virtual try-on, and scene composition.
中文: 本研究提出Insert Anything框架,通过单一模型在多样化AnyInsertion数据集上训练,利用多模态注意力和上下文编辑机制,实现参考图像中对象在目标场景中的灵活插入,在多个基准测试中展现出卓越性能。
English: This study introduces Insert Anything, a unified framework that enables flexible, user-guided insertion of objects from reference images into target scenes using a single model trained on the diverse AnyInsertion dataset, leveraging multimodal attention and in-context editing to achieve superior performance across various benchmarks.

Authors:Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang
Title: RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
Abstract:
Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.
中文:本文提出的RealisDance-DiT方法通过对强大视频基础模型进行最小化架构修改并采用高效微调策略,在应对多样化现实挑战时大幅优于现有方法,实现了更优的可控角色动画效果。
English: This paper introduces RealisDance-DiT, a method that achieves superior controllable character animation by minimally modifying a powerful video foundation model and implementing efficient fine-tuning strategies, significantly outperforming existing approaches across diverse real-world challenges.

Authors:Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen
Title: MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Abstract:
Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
中文: 数据质量和多样性对构建有效的指令调优数据集至关重要,而提出的MIG方法通过标签图建模语义空间并最大化信息增益,在实验中始终优于现有方法。
English: Data quality and diversity are crucial for effective instruction-tuning datasets, and the proposed MIG method maximizes information gain by modeling the semantic space with a label graph, outperforming existing approaches in experiments.

Authors:Yule Liu, Jingyi Zheng, Zhen Sun, Zifan Peng, Wenhan Dong, Zeyang Sha, Shiwen Cui, Weiqiang Wang, Xinlei He
Title: Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
Abstract:
Recent advancements in large reasoning models (LRMs) have demonstrated the effectiveness of scaling test-time computation to enhance reasoning capabilities on various tasks. However, LRMs often suffer from an ``overthinking'' problem, where the model generates excessively redundant reasoning steps with limited performance gains. In this work, we empirically reveal an important characteristic of LRM behaviors that placing external CoTs generated by smaller models between the thinking token (\texttt{} and \texttt{}) can effectively manipulate the model to generate fewer thoughts. Building on this finding, we propose a simple yet efficient pipeline, \Method, to enable LRMs to bypass unnecessary intermediate steps, thereby significantly reducing computational costs. We conduct extensive experiments to evaluate the utility and efficiency of \Method. For instance, when applied to QwQ-32B on the LiveBench/Code dataset, \Method keeps the original performance while reducing output token counts by approximately 30\%, with minimal overhead introduced by the CoT generator. Furthermore, we identify two suboptimal modes, blindly following flawed external thoughts and unnecessary rethinking, and show that simple mitigations, such as difficulty-aware fallbacks, can further improve performance. Overall, \Method offers a practical, general, and efficient way to optimize LRM inference, making powerful reasoning models more accessible and scalable for real-world applications.
中文:该方法通过引入较小模型生成的外部思维链,使大型推理模型能有效减少冗余计算步骤,在保持性能的同时显著降低约30%的输出标记量,且额外开销极小。
English: The proposed Method enables large reasoning models to reduce redundant computation by incorporating external CoTs from smaller models, maintaining performance while cutting output tokens by 30% with minimal overhead.

Authors:Aniket Roy, Shubhankar Borse, Shreya Kadambi, Debasmit Das, Shweta Mahajan, Risheek Garrepalli, Hyojin Park, Ankita Nayak, Rama Chellappa, Munawar Hayat, Fatih Porikli
Title: DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization
Abstract:
We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertwined, not independent. To address this, we propose DuoLoRA, a content-style personalization framework featuring three key components: (i) rank-dimension mask learning, (ii) effective merging via layer priors, and (iii) Constyle loss, which leverages cycle-consistency in the merging process. First, we introduce ZipRank, which performs content-style merging within the rank dimension, offering adaptive rank flexibility and significantly reducing the number of learnable parameters. Additionally, we incorporate SDXL layer priors to apply implicit rank constraints informed by each layer's content-style bias and adaptive merger initialization, enhancing the integration of content and style. To further refine the merging process, we introduce Constyle loss, which leverages the cycle-consistency between content and style. Our experimental results demonstrate that DuoLoRA outperforms state-of-the-art content-style merging methods across multiple benchmarks.
Chinese: 我们提出DuoLoRA框架,通过秩维度掩码学习、层级先验和Constyle损失有效融合内容与风格个性化,突破现有方法将二者独立处理的局限,在多项基准测试中优于最先进技术。
English: We propose DuoLoRA, a novel framework that effectively merges content and style personalization through rank-dimension mask learning, layer priors, and Constyle loss, outperforming existing methods by addressing their intertwined nature rather than treating them as independent.

Authors:Guanyu Wang, Kailong Wang, Yihao Huang, Mingyi Zhou, Zhang Qing cnwatcher, Geguang Pu, Li Li
Title: Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints
Abstract:
The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.
中文: 提出的跨图像反个性化(CAP)框架通过利用图像间风格一致性和自适应损失平衡,相比现有方法能更有效地增强隐私保护以对抗个性化威胁。
English: The proposed Cross-image Anti-Personalization (CAP) framework enhances privacy protection by leveraging inter-image style consistency and adaptive loss balancing to counter personalization threats more effectively than existing methods.

Authors:Lynnette Hui Xian Ng, Kathleen M. Carley
Title: The Dual Personas of Social Media Bots
Abstract:
Social media bots are AI agents that participate in online conversations. Most studies focus on the general bot and the malicious nature of these agents. However, bots have many different personas, each specialized towards a specific behavioral or content trait. Neither are bots singularly bad, because they are used for both good and bad information dissemination. In this article, we introduce fifteen agent personas of social media bots. These personas have two main categories: Content-Based Bot Persona and Behavior-Based Bot Persona. We also form yardsticks of the good-bad duality of the bots, elaborating on metrics of good and bad bot agents. Our work puts forth a guideline to inform bot detection regulation, emphasizing that policies should focus on how these agents are employed, rather than collectively terming bot agents as bad.
中文: 本文介绍了社交媒体机器人的十五种角色,分为内容型和行为型两类,并提出了评估其双重性质的框架,旨在为检测政策提供指导,强调应关注机器人的使用方式而非一概视为恶意。
English: This article introduces fifteen personas of social media bots, categorized into content-based and behavior-based types, and proposes a framework to evaluate their dual nature for informing detection policies that focus on usage rather than labeling all bots as malicious.

Authors:Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan
Title: Cobra: Efficient Line Art COlorization with BRoAder References
Abstract:
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.
中文摘要:Cobra提出了一种高效的线稿上色方法,采用因果稀疏DiT架构处理200多张参考图像并保持低延迟,实现了工业级的精准度和速度要求。
English Summary: Cobra introduces an efficient line art colorization method using a Causal Sparse DiT architecture to handle over 200 reference images with low latency, achieving industrial-grade accuracy and speed.

Authors:Arun Verma, Xiaoqiang Lin, Zhongxiang Dai, Daniela Rus, Bryan Kian Hsiang Low
Title: Active Human Feedback Collection via Neural Contextual Dueling Bandits
Abstract:
Collecting human preference feedback is often expensive, leading recent works to develop principled algorithms to select them more efficiently. However, these works assume that the underlying reward function is linear, an assumption that does not hold in many real-life applications, such as online recommendation and LLM alignment. To address this limitation, we propose Neural-ADB, an algorithm based on the neural contextual dueling bandit framework that provides a principled and practical method for collecting human preference feedback when the underlying latent reward function is non-linear. We theoretically show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases. Our experimental results on preference datasets further corroborate the effectiveness of Neural-ADB.
中文摘要:Neural-ADB是一种基于神经上下文对决赌博机框架的新算法,针对非线性潜在奖励函数提供了系统化的人类偏好反馈收集方法,理论分析和实验结果表明其在偏好数据集上具有显著有效性。
English Summary: Neural-ADB is a novel algorithm designed to efficiently collect human preference feedback for non-linear reward functions, offering theoretical guarantees and experimental validation for improved performance in applications like recommendation systems and LLM alignment.

Authors:Minghua Liu, Mikaela Angelina Uy, Donglai Xiang, Hao Su, Sanja Fidler, Nicholas Sharp, Jun Gao
Title: PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond
Abstract:
We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! https://research.nvidia.com/labs/toronto-ai/partfield-release/
中文:PartField是一种无需预定义模板的前馈方法,可学习分层部件化3D特征,相比现有方法精度提升高达20%且运行速度显著加快,同时支持协同分割等应用。
English: PartField is a feedforward method that learns hierarchical part-based 3D features without predefined templates, achieving up to 20% higher accuracy and significantly faster runtime than existing approaches while enabling applications like co-segmentation.

Authors:Jiahuan Long, Wen Yao, Tingsong Jiang, Chao Ma
Title: CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors
Abstract:
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.
中文摘要:CDUPatch是一种新型跨模态对抗补丁攻击方法,通过利用颜色引发的热变化和跨模态统一优化,显著提升了在可见光-红外物体检测器上的实际攻击效果。
English Summary: CDUPatch is a novel cross-modal adversarial patch attack that enhances real-world effectiveness against visible-infrared object detectors by leveraging color-induced thermal variations and unified optimization across modalities.

Authors:Mauro Conti, Francesco Marchiori, Sebastiano Matarazzo, Marco Rubin
Title: PQ-CAN: A Framework for Simulating Post-Quantum Cryptography in Embedded Systems
Abstract:
The rapid development of quantum computers threatens traditional cryptographic schemes, prompting the need for Post-Quantum Cryptography (PQC). Although the NIST standardization process has accelerated the development of such algorithms, their application in resource-constrained environments such as embedded systems remains a challenge. Automotive systems relying on the Controller Area Network (CAN) bus for communication are particularly vulnerable due to their limited computational capabilities, high traffic, and need for real-time response. These constraints raise concerns about the feasibility of implementing PQC in automotive environments, where legacy hardware and bit rate limitations must also be considered. In this paper, we introduce PQ-CAN, a modular framework for simulating the performance and overhead of PQC algorithms in embedded systems. We consider the automotive domain as our case study, testing a variety of PQC schemes under different scenarios. Our simulation enables the adjustment of embedded system computational capabilities and CAN bus bit rate constraints. We also provide insights into the trade-offs involved by analyzing each algorithm's security level and overhead for key encapsulation and digital signature. By evaluating the performance of these algorithms, we provide insights into their feasibility and identify the strengths and limitations of PQC in securing automotive communications in the post-quantum era.
中文: 本文提出PQ-CAN模块化框架,用于在嵌入式汽车系统中模拟后量子密码算法,评估其在计算能力和CAN总线比特率受限条件下的性能表现与实施可行性。
English: This paper introduces PQ-CAN, a modular framework for simulating post-quantum cryptography algorithms in embedded automotive systems, evaluating their performance and feasibility under constrained computational and CAN bus bit rate conditions.

Authors:Francesco Marchiori, Denis Donadel, Mauro Conti
Title: Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors
Abstract:
Common Vulnerability and Exposure (CVE) records are fundamental to cybersecurity, offering unique identifiers for publicly known software and system vulnerabilities. Each CVE is typically assigned a Common Vulnerability Scoring System (CVSS) score to support risk prioritization and remediation. However, score inconsistencies often arise due to subjective interpretations of certain metrics. As the number of new CVEs continues to grow rapidly, automation is increasingly necessary to ensure timely and consistent scoring. While prior studies have explored automated methods, the application of Large Language Models (LLMs), despite their recent popularity, remains relatively underexplored. In this work, we evaluate the effectiveness of LLMs in generating CVSS scores for newly reported vulnerabilities. We investigate various prompt engineering strategies to enhance their accuracy and compare LLM-generated scores against those from embedding-based models, which use vector representations classified via supervised learning. Our results show that while LLMs demonstrate potential in automating CVSS evaluation, embedding-based methods outperform them in scoring more subjective components, particularly confidentiality, integrity, and availability impacts. These findings underscore the complexity of CVSS scoring and suggest that combining LLMs with embedding-based methods could yield more reliable results across all scoring components.
中文摘要:本研究评估了大语言模型在自动化网络安全漏洞CVSS评分中的有效性,发现尽管大语言模型展现出潜力,但基于嵌入的方法在主观评分组件上表现更优,表明结合两种方法可能产生更可靠的结果。
English Summary: This study evaluates the effectiveness of Large Language Models (LLMs) in automating CVSS scoring for cybersecurity vulnerabilities, finding that while LLMs show promise, embedding-based methods perform better on subjective components, suggesting a combined approach may yield more reliable results.

Authors:Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng
Title: LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
Abstract:
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.
中文: 本文提出了交互式行人重识别(Inter-ReID)任务,通过对话交互逐步完善不完整的目击者描述,并开发了LLaVA-ReID模型,该模型能基于视觉和文本上下文生成针对性问题以获取更多细节,实验表明其性能显著优于现有基线方法。
English: This paper introduces interactive person re-identification (Inter-ReID), a dialogue-based approach that refines incomplete witness descriptions through iterative questioning, and proposes LLaVA-ReID, a model that generates targeted questions to enhance retrieval accuracy, demonstrating superior performance over existing methods.

Authors:Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine
Title: The Structural Safety Generalization Problem
Abstract:
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.
中文: 本研究针对大语言模型越狱漏洞,聚焦安全机制在语义等效输入间的泛化失效问题,提出系统性攻击分析框架,并设计结构重写护栏,在有效拦截有害输入的同时避免过度拒绝良性交互。
English: This study addresses LLM jailbreak vulnerabilities by focusing on safety generalization failures across semantically equivalent inputs, proposing a framework for systematic attack analysis and introducing a Structure Rewriting Guardrail that enhances harmful input refusal without compromising benign interactions.

Authors:Jiahuan Long, Tingsong Jiang, Wen Yao, Shuai Jia, Weijia Zhang, Weien Zhou, Chao Ma, Xiaoqian Chen
Title: PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking
Abstract:
Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.
中文: 本文提出PapMOT方法,通过生成可打印的对抗性补丁,在数字和物理场景中攻击多目标跟踪系统的检测机制并干扰身份关联过程,实验证明该方法能有效破坏多种跟踪器的性能。
English: This paper introduces PapMOT, a method that generates physical adversarial patches to disrupt multiple object tracking in both digital and physical environments by attacking detection mechanisms and misleading identity associations, with evaluations confirming its effectiveness across various MOT architectures.

Authors:Haotian Ye, Himanshu Jain, Chong You, Ananda Theertha Suresh, Haowei Lin, James Zou, Felix Yu
Title: Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models
Abstract:
In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.
中文摘要:本文提出DISC与PPV算法,通过动态重要性采样实现无偏且高效的约束解码,解决了传统前缀树方法效率低和偏差问题。
English Summary: This paper introduces DISC with PPV, a novel algorithm that uses dynamic importance sampling to achieve unbiased and efficient constrained decoding, overcoming the inefficiency and bias of traditional prefix-tree methods.

Authors:Jiahuan Long, Tingsong Jiang, Wen Yao, Yizhe Xiong, Zhengqin Xu, Shuai Jia, Chao Ma
Title: Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models
Abstract:
Vision foundation models (VFMs) are large pre-trained models that form the backbone of various vision tasks. Fine-tuning VFMs can further unlock their potential for downstream tasks or scenarios. However, VFMs often contain significant feature redundancy, which may limit their adaptability to new tasks. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a parameter-free fine-tuning method to address this issue. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on model fine-tuning. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse the more relevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method. Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces computational and GPU memory overhead.
中文: 本文针对视觉基础模型提出一种无需参数的微调方法,通过特征选择与重用机制识别并替换冗余通道,在提升任务性能的同时显著降低计算开销。
English: This paper introduces a parameter-free fine-tuning method for vision foundation models like SAM, which identifies and replaces redundant channels through feature selection and reuse to enhance task-specific performance while reducing computational costs.

Authors:Jiahuan Long, Zhengqin Xu, Tingsong Jiang, Wen Yao, Shuai Jia, Chao Ma, Xiaoqian Chen
Title: Robust SAM: On the Adversarial Robustness of Vision Foundation Models
Abstract:
The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.
Chinese: 本文提出了一个针对SAM模型的对抗鲁棒性框架,通过跨提示攻击方法增强攻击迁移性,并采用基于奇异值分解的少参数自适应防御策略,在提升鲁棒性的同时最大限度地保持模型原有性能。
English: This paper introduces an adversarial robustness framework for the Segment Anything Model (SAM), featuring a cross-prompt attack method to improve transferability and a parameter-efficient defense strategy using singular value decomposition to balance robustness and accuracy.

Authors:Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Hadi Pouransari
Title: FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Abstract:
Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.
中文摘要:FocalLens是一种条件视觉编码方法,能根据自然语言指令为同一图像生成不同的表征,相比通用编码器更能突出目标视觉特征,并在多项视觉任务中显著提升性能表现。
English Summary: FocalLens is a conditional visual encoding method that generates context-specific image representations based on natural language instructions, outperforming standard encoders by prioritizing relevant visual features and improving performance across multiple vision tasks.

Authors:Nirvan Patil, Malhar Abhay Inamdar, Agnivo Gosai, Guruprasad Pathak, Anish Joshi, Aryan Sagavekar, Anish Joshirao, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Title: Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Abstract:
Small Language Models (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.
中文:本研究将TinyStories框架扩展至印地语、马拉地语和孟加拉语,证明采用语言特定分词器的小型语言模型能通过合成数据高效处理区域语言,其性能优于通用模型,并揭示了语言发展中的跨语言规律。
English: This study extends the TinyStories framework to Hindi, Marathi, and Bengali, demonstrating that small language models with language-specific tokenizers efficiently process regional languages using synthetic data and outperform general-purpose models while revealing cross-linguistic patterns in language development.

Authors:Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi
Title: Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
Abstract:
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.
中文: Geo4D利用视频扩散模型实现动态场景的单目三维重建,仅需合成数据训练即可零样本泛化至真实数据,并通过多模态对齐与滑动窗口推理显著超越现有最优方法。
English: Geo4D repurposes video diffusion models for monocular 3D reconstruction of dynamic scenes, leveraging synthetic data training with zero-shot generalization to real data, and surpasses state-of-the-art methods through multi-modal fusion and sliding window inference.

Authors:Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby
Title: Scaling Laws for Native Multimodal Models
Abstract:
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modalities-and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
中文: 本研究通过大量实验挑战了后期融合多模态架构的优越性,证明早期融合模型能以更少参数实现更强性能、更高训练效率和更易部署,且结合专家混合机制可进一步提升模态特定学习效果。
English: This study challenges the superiority of late-fusion multimodal architectures by demonstrating through extensive experiments that early-fusion models achieve better performance with fewer parameters, greater training efficiency, and easier deployment, further enhanced by incorporating Mixture of Experts for modality-specific learning.

Authors:Shuying Gan, Xijun Wang, Chenyuan Feng, Chao Xu, Howard H. Yang, Xiang Chen, Tony Q. S. Quek
Title: Task-oriented Age of Information for Remote Inference with Hybrid Language Models
Abstract:
Large Language Models (LLMs) have revolutionized the field of artificial intelligence (AI) through their advanced reasoning capabilities, but their extensive parameter sets introduce significant inference latency, posing a challenge to ensure the timeliness of inference results. While Small Language Models (SLMs) offer faster inference speeds with fewer parameters, they often compromise accuracy on complex tasks. This study proposes a novel remote inference system comprising a user, a sensor, and an edge server that integrates both model types alongside a decision maker. The system dynamically determines the resolution of images transmitted by the sensor and routes inference tasks to either an SLM or LLM to optimize performance. The key objective is to minimize the Task-oriented Age of Information (TAoI) by jointly considering the accuracy and timeliness of the inference task. Due to the non-uniform transmission time and inference time, we formulate this problem as a Semi-Markov Decision Process (SMDP). By converting the SMDP to an equivalent Markov decision process, we prove that the optimal control policy follows a threshold-based structure. We further develop a relative policy iteration algorithm leveraging this threshold property. Simulation results demonstrate that our proposed optimal policy significantly outperforms baseline approaches in managing the accuracy-timeliness trade-off.
中文摘要:本研究提出一种远程推理系统,通过动态选择大语言模型或小语言模型并调整图像分辨率,以最小化任务导向信息年龄,仿真结果表明该系统在权衡准确性与及时性方面显著优于基准方法。
English Summary: This study introduces a remote inference system that dynamically selects between Large and Small Language Models and adjusts image resolution to minimize Task-oriented Age of Information, with simulations confirming its superior performance in balancing accuracy and timeliness compared to baseline methods.

Authors:Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
Title: AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Abstract:
We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda's dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user's goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill's documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda's performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.
中文摘要:AgentAda是首个基于大语言模型的分析智能体,能够自动从技能库中选择并运用专业分析方法来生成深度洞察,在人工评估中优于现有工具。
English Summary: AgentAda is the first LLM-powered analytics agent that autonomously selects and applies specialized analytical skills from a library to generate insights, outperforming existing tools in human evaluations.

Authors:Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
Title: AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Abstract:
We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda's dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user's goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill's documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda's performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.
中文摘要:AgentAda是首个基于大语言模型的分析智能体,能够自动从技能库中选择并运用专业分析方法来生成深度洞察,在人工评估中优于现有工具。
English Summary: AgentAda is the first LLM-powered analytics agent that autonomously selects and applies specialized analytical skills from a library to generate insights, outperforming existing tools in human evaluations.

Authors:William Andrew Simon, Irem Boybat, Riselda Kodra, Elena Ferro, Gagandeep Singh, Mohammed Alser, Shubham Jain, Hsinyu Tsai, Geoffrey W. Burr, Onur Mutlu, Abu Sebastian
Title: CiMBA: Accelerating Genome Sequencing through On-Device Basecalling via Compute-in-Memory
Abstract:
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded ($\sim25$mm$^2$) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17x/27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.
中文: 提出的内存计算碱基识别加速器(CiMBA)与AnaLog-Dorado神经网络通过实现实时设备端碱基识别,解决了基因组测序的瓶颈,在保持高精度的同时达到所需吞吐量的24倍,并显著提升了能效和面积效率。
English: The proposed Compute-in-Memory Basecalling Accelerator (CiMBA) with AnaLog-Dorado DNNs addresses genome sequencing bottlenecks by enabling real-time, on-device basecalling, achieving 24x required throughput and significant power/area efficiency improvements while maintaining high accuracy.

Authors:Alexandra Ertl, Shuhan Xiao, Stefan Denner, Robin Peretzke, David Zimmerer, Peter Neher, Fabian Isensee, Klaus Maier-Hein
Title: nnLandmark: A Self-Configuring Method for 3D Medical Landmark Detection
Abstract:
Landmark detection plays a crucial role in medical imaging tasks that rely on precise spatial localization, including specific applications in diagnosis, treatment planning, image registration, and surgical navigation. However, manual annotation is labor-intensive and requires expert knowledge. While deep learning shows promise in automating this task, progress is hindered by limited public datasets, inconsistent benchmarks, and non-standardized baselines, restricting reproducibility, fair comparisons, and model generalizability. This work introduces nnLandmark, a self-configuring deep learning framework for 3D medical landmark detection, adapting nnU-Net to perform heatmap-based regression. By leveraging nnU-Net's automated configuration, nnLandmark eliminates the need for manual parameter tuning, offering out-of-the-box usability. It achieves state-of-the-art accuracy across two public datasets, with a mean radial error (MRE) of 1.5 mm on the Mandibular Molar Landmark (MML) dental CT dataset and 1.2 mm for anatomical fiducials on a brain MRI dataset (AFIDs), where nnLandmark aligns with the inter-rater variability of 1.5 mm. With its strong generalization, reproducibility, and ease of deployment, nnLandmark establishes a reliable baseline for 3D landmark detection, supporting research in anatomical localization and clinical workflows that depend on precise landmark identification. The code will be available soon.
中文: nnLandmark是一种自配置深度学习框架,无需手动调参即可在三维医学标志点检测中实现最先进的精度,具备强大的泛化能力和可重复性。
English: nnLandmark is a self-configuring deep learning framework that achieves state-of-the-art accuracy in 3D medical landmark detection, offering strong generalization and reproducibility without manual parameter tuning.

Authors:Zhihua Xu, Tianshui Chen, Zhijing Yang, Siyuan Peng, Keze Wang, Liang Lin
Title: Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation
Abstract:
The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.
中文摘要:音频驱动单样本说话头动画的主要挑战在于捕捉相邻视频帧间的细微变化,本文提出通过时序视听关联嵌入框架学习并整合视听时序关系,以增强特征表示并监督动画生成。
English Summary: The primary challenge in audio-driven one-shot talking head animation is capturing subtle frame changes, which is addressed by a novel Temporal Audio-Visual Correlation Embedding framework that learns and integrates audio-visual temporal relationships to enhance feature representation and supervise animation generation.

Authors:Jia Li, Jiacheng Shen, Yuxin Su, Michael R. Lyu
Title: LLM-assisted Mutation for Whitebox API Testing
Abstract:
Cloud applications heavily rely on APIs to communicate with each other and exchange data. To ensure the reliability of cloud applications, cloud providers widely adopt API testing techniques. Unfortunately, existing API testing approaches are insufficient to reach strict conditions, a problem known as fitness plateaus, due to the lack of gradient provided by coverage metrics. To address this issue, we propose MioHint, a novel white-box API testing approach that leverages the code comprehension capabilities of Large Language Model (LLM) to boost API testing. The key challenge of LLM-based API testing lies in system-level testing, which emphasizes the dependencies between requests and targets across functions and files, thereby making the entire codebase the object of analysis. However, feeding the entire codebase to an LLM is impractical due to its limited context length and short memory. MioHint addresses this challenge by synergizing static analysis with LLMs. We retrieve relevant code with data-dependency analysis at the statement level, including def-use analysis for variables used in the target and function expansion for subfunctions called by the target. To evaluate the effectiveness of our method, we conducted experiments across 16 real-world REST API services. The findings reveal that MioHint achieves an average increase of 4.95% absolute in line coverage compared to the baseline, EvoMaster, alongside a remarkable factor of 67x improvement in mutation accuracy. Furthermore, our method successfully covers over 57% of hard-to-cover targets while in baseline the coverage is less than 10%.
中文: MioHint是一种创新的白盒API测试方法,通过结合大型语言模型的代码理解能力和静态分析技术,有效解决了适应度平台问题,在真实API服务测试中显著提升了代码行覆盖率和变异准确率。
English: MioHint is a novel white-box API testing approach that utilizes Large Language Models and static analysis to overcome fitness plateaus, significantly improving line coverage and mutation accuracy in cloud applications compared to existing methods.

Authors:Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, Cees G. M. Snoek
Title: SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Abstract:
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.
中文: 本研究全面评估了视频自监督学习模型,揭示了它们在领域迁移、任务粒度等因素下泛化能力的不一致性,同时为未来研究建立了统一基准。
English: This study comprehensively evaluates video self-supervised learning models, revealing their inconsistent generalization across domain shifts, task granularity, and other factors while establishing a unified benchmark for future research.

Authors:Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin
Title: Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation
Abstract:
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.
Chinese: 本研究提出了一种对比解耦表示学习(CDRL)算法,通过对比学习分别获取内容和情感表示,从而在保持口型同步的同时实现精确的面部表情操控。
English: The study introduces a Contrastive Decoupled Representation Learning (CDRL) algorithm that learns separate content and emotion representations through contrastive learning, enabling precise facial expression manipulation while preserving speech synchronization.

Authors:Narine Kokhlikyan, Bargav Jayaraman, Florian Bordes, Chuan Guo, Kamalika Chaudhuri
Title: Measuring Déjà vu Memorization Efficiently
Abstract:
Recent research has shown that representation learning models may accidentally memorize their training data. For example, the déjà vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of the background - better than through dataset-level correlations. However, their measurement method requires training two models - one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alternative simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model's memorization ability without any retraining. This enables, for the first time, the measurement of memorization in pre-trained open-source image representation and vision-language representation models. Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data. The code is available both for vision and vision language models.
中文: 本研究提出了一种无需重新训练即可评估预训练开源模型记忆效应的简便方法,并发现这类模型通常比在数据子集上训练的模型具有更低的记忆程度。
English: This study introduces a simple method to estimate memorization in pre-trained open-source models without retraining, revealing that such models generally exhibit lower memorization than those trained on data subsets.

Authors:Qian-Wen Zhang, Fang Li, Jie Wang, Lingfeng Qiao, Yifei Yu, Di Yin, Xing Sun
Title: FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction
Abstract:
Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.
中文: 本研究提出一种多智能体协同数据增强方法,自动生成基于证据的问答对和不可回答问题,构建FactGuard-Bench数据集,旨在提升大语言模型处理可答与不可答问题的准确率,同时显著降低人工标注成本。
English: This study introduces a multi-agent collaborative data augmentation method to autonomously generate evidence-based question-answer pairs and unanswerable questions, creating the FactGuard-Bench dataset to enhance LLMs' accuracy in handling both answerable and unanswerable queries while reducing manual annotation costs.

Authors:Fan Nie, Lan Feng, Haotian Ye, Weixin Liang, Pan Lu, Huaxiu Yao, Alexandre Alahi, James Zou
Title: Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors
Abstract:
Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical. Existing training-free methods, including manually or automated designed workflows, typically demand substantial human effort or yield suboptimal results. This paper proposes Weak-for-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize workflows for harnessing stronger models. W4S formulates workflow design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization (RLAO) to train a weak meta-agent. Through iterative interaction with the environment, the meta-agent learns to design increasingly effective workflows without manual intervention. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% ~ 24.6% across eleven benchmarks, successfully elevating the performance of state-of-the-art models such as GPT-3.5-Turbo and GPT-4o. Notably, W4S exhibits strong generalization capabilities across both seen and unseen tasks, offering an efficient, high-performing alternative to directly fine-tuning strong models.
中文: 本文提出弱模型驱动强模型(W4S)框架,通过强化学习训练小型元代理自动优化工作流程,显著提升GPT-3.5等大型语言模型的性能,在多项基准测试中超越现有方法且具备跨任务泛化能力。
English: This paper introduces the Weak-for-Strong Harnessing (W4S) framework, which trains a smaller, cost-efficient meta-agent using reinforcement learning to automatically design and optimize workflows for enhancing the performance of stronger large language models, achieving significant improvements across multiple benchmarks without manual intervention.

Authors:Yifei Yu, Qian-Wen Zhang, Lingfeng Qiao, Di Yin, Fang Li, Jie Wang, Zengxi Chen, Suncong Zheng, Xiaolong Liang, Xing Sun
Title: Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Abstract:
Evaluating the ability of large language models (LLMs) to process lengthy contexts is critical, especially for retrieving query-relevant information embedded within them. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as \emph{needles}) from long contexts. The benchmark includes three needle generation pipelines: synthetic-temporal, real-temporal, and real-logical orders, with context lengths ranging from 8K to 128K, which comprises 14,000 samples (2,000 for testing). To facilitate the evaluation of this benchmark, we trained an evaluation model that assesses the correctness of LLM responses by comparing their completeness and sequential consistency against the ground truth, which provides a more reliable evaluation metric than GPT-4 or Claude. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.50% on test set of this benchmark. Further analysis highlights the growing challenges posed by increasing the context length or the number of needles, underscoring substantial room for improvement of LLMs. Additionally, noise analysis validates the reliability and challenge of the benchmark, making Sequential-NIAH an important reference for advancing research on long text information extraction capabilities of LLMs.
中文: Sequential-NIAH是专门评估大语言模型从长文本中提取顺序信息能力的基准测试,通过实验发现现有模型最高准确率仅63.50%,为推进长文本信息提取研究提供了重要参考依据。
English: Sequential-NIAH is a benchmark designed to evaluate LLMs' ability to extract sequential information from long contexts, revealing significant performance gaps and providing a reliable assessment tool for advancing long-text information extraction research.

Authors:Nicolo Resmini, Eugenio Lomurno, Cristian Sbrolli, Matteo Matteucci
Title: Your Image Generator Is Your New Private Dataset
Abstract:
Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.
中文摘要:本文提出的文本条件知识回收(TCKR)流程通过扩散模型生成合成训练数据,在实现与真实数据相当分类精度的同时,显著提升了隐私保护能力,有效降低了成员推理攻击的风险。
English Summary: The paper introduces the Text-Conditioned Knowledge Recycling (TCKR) pipeline that generates synthetic training data through diffusion models, achieving classification accuracy comparable to real data while significantly enhancing privacy protection by reducing vulnerability to membership inference attacks.

Authors:Yi Xu, Weicong Qin, Weijie Yu, Ming He, Jianping Fan, Jun Xu
Title: Decoding Recommendation Behaviors of In-Context Learning LLMs Through Gradient Descent
Abstract:
Recently, there has been a growing trend in utilizing large language models (LLMs) for recommender systems, referred to as LLMRec. A notable approach within this trend is not to fine-tune these models directly but instead to leverage In-Context Learning (ICL) methods tailored for LLMRec, denoted as LLM-ICL Rec. Many contemporary techniques focus on harnessing ICL content to enhance LLMRec performance. However, optimizing LLMRec with ICL content presents unresolved challenges. Specifically, two key issues stand out: (1) the limited understanding of why using a few demonstrations without model fine-tuning can lead to better performance compared to zero-shot recommendations. (2) the lack of evaluation metrics for demonstrations in LLM-ICL Rec and the absence of the theoretical analysis and practical design for optimizing the generation of ICL content for recommendation contexts. To address these two main issues, we propose a theoretical model, the LLM-ICL Recommendation Equivalent Gradient Descent model (LRGD) in this paper, which connects recommendation generation with gradient descent dynamics. We demonstrate that the ICL inference process in LLM aligns with the training procedure of its dual model, producing token predictions equivalent to the dual model's testing outputs. Building on these theoretical insights, we propose an evaluation metric for assessing demonstration quality. We integrate perturbations and regularizations in LRGD to enhance the robustness of the recommender system. To further improve demonstration effectiveness, prevent performance collapse, and ensure long-term adaptability, we also propose a two-stage optimization process in practice. Extensive experiments and detailed analysis on three Amazon datasets validate the theoretical equivalence and support the effectiveness of our theoretical analysis and practical module design.
中文: 近期研究探索无需微调而通过上下文学习将大语言模型应用于推荐系统,针对演示效果理解和评估指标缺失的挑战,提出了将推荐生成与梯度下降动态相联系的理论模型及实践优化方法,并通过实验验证了有效性。
English: Recent research explores using large language models for recommender systems through in-context learning without fine-tuning, addressing challenges in understanding demonstration effectiveness and evaluation metrics by proposing a theoretical model that connects recommendation generation with gradient descent dynamics and introducing practical optimization methods validated through experiments.

Authors:Xiao Lin, Zhichen Zeng, Tianxin Wei, Zhining Liu, Yuzhong chen, Hanghang Tong
Title: CATS: Mitigating Correlation Shift for Multivariate Time Series Classification
Abstract:
Unsupervised Domain Adaptation (UDA) leverages labeled source data to train models for unlabeled target data. Given the prevalence of multivariate time series (MTS) data across various domains, the UDA task for MTS classification has emerged as a critical challenge. However, for MTS data, correlations between variables often vary across domains, whereas most existing UDA works for MTS classification have overlooked this essential characteristic. To bridge this gap, we introduce a novel domain shift, {\em correlation shift}, measuring domain differences in multivariate correlation. To mitigate correlation shift, we propose a scalable and parameter-efficient \underline{C}orrelation \underline{A}dapter for M\underline{TS} (CATS). Designed as a plug-and-play technique compatible with various Transformer variants, CATS employs temporal convolution to capture local temporal patterns and a graph attention module to model the changing multivariate correlation. The adapter reweights the target correlations to align the source correlations with a theoretically guaranteed precision. A correlation alignment loss is further proposed to mitigate correlation shift, bypassing the alignment challenge from the non-i.i.d. nature of MTS data. Extensive experiments on four real-world datasets demonstrate that (1) compared with vanilla Transformer-based models, CATS increases over $10\%$ average accuracy while only adding around $1\%$ parameters, and (2) all Transformer variants equipped with CATS either reach or surpass state-of-the-art baselines.
中文: 本研究提出新的相关性偏移概念和CATS适配器,通过时序卷积和图注意力实现跨领域多元相关性对齐,以极少参数显著提升分类准确率。
English: This study introduces a novel correlation shift concept and proposes CATS, a plug-and-play adapter that aligns multivariate correlations across domains using temporal convolution and graph attention, achieving significant accuracy improvements with minimal parameter increase.

Authors:Runlong Yu, Shengyu Chen, Yiqun Xie, Huaxiu Yao, Jared Willard, Xiaowei Jia
Title: Foundation Models for Environmental Science: A Survey of Emerging Frontiers
Abstract:
Modeling environmental ecosystems is essential for effective resource management, sustainable development, and understanding complex ecological processes. However, traditional data-driven methods face challenges in capturing inherently complex and interconnected processes and are further constrained by limited observational data in many environmental applications. Foundation models, which leverages large-scale pre-training and universal representations of complex and heterogeneous data, offer transformative opportunities for capturing spatiotemporal dynamics and dependencies in environmental processes, and facilitate adaptation to a broad range of applications. This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in common environmental use cases including forward prediction, data generation, data assimilation, downscaling, inverse modeling, model ensembling, and decision-making across domains. We also detail the process of developing these models, covering data collection, architecture design, training, tuning, and evaluation. Through discussions on these emerging methods as well as their future opportunities, we aim to promote interdisciplinary collaboration that accelerates advancements in machine learning for driving scientific discovery in addressing critical environmental challenges.
中文: 基础模型通过大规模数据整合与通用表征,克服了传统环境建模的数据局限,在预测、决策等应用中展现出变革潜力,推动环境科学的发展。
English: Foundation models offer transformative potential for environmental science by overcoming traditional data limitations and enabling advanced applications like prediction and decision-making through large-scale data integration and universal representations.

Authors:Xin Quan, Marco Valentino, Danilo S. Carvalho, Dhairya Dalal, André Freitas
Title: PEIRCE: Unifying Material and Formal Reasoning via LLM-Driven Neuro-Symbolic Refinement
Abstract:
A persistent challenge in AI is the effective integration of material and formal inference - the former concerning the plausibility and contextual relevance of arguments, while the latter focusing on their logical and structural validity. Large Language Models (LLMs), by virtue of their extensive pre-training on large textual corpora, exhibit strong capabilities in material inference. However, their reasoning often lacks formal rigour and verifiability. At the same time, LLMs' linguistic competence positions them as a promising bridge between natural and formal languages, opening up new opportunities for combining these two modes of reasoning. In this paper, we introduce PEIRCE, a neuro-symbolic framework designed to unify material and formal inference through an iterative conjecture-criticism process. Within this framework, LLMs play the central role of generating candidate solutions in natural and formal languages, which are then evaluated and refined via interaction with external critique models. These critiques include symbolic provers, which assess formal validity, as well as soft evaluators that measure the quality of the generated arguments along linguistic and epistemic dimensions such as plausibility, coherence, and parsimony. While PEIRCE is a general-purpose framework, we demonstrate its capabilities in the domain of natural language explanation generation - a setting that inherently demands both material adequacy and formal correctness.
中文摘要:PEIRCE是一个神经符号框架,通过迭代的猜想-批判过程整合实质推理与形式推理,利用大语言模型生成候选方案并由外部批判模型评估,在需要语境相关性和逻辑严谨性的自然语言解释生成中展现了优势。
English Summary: PEIRCE is a neuro-symbolic framework that integrates material and formal inference through an iterative process where LLMs generate candidate solutions evaluated by external critique models, demonstrating its effectiveness in natural language explanation generation requiring both contextual relevance and logical validity.

Authors:Kepu Zhang, Weijie Yu, Zhongxiang Sun, Jun Xu
Title: An Explicit Syllogistic Legal Reasoning Framework for Large Language Models
Abstract:
Syllogistic reasoning is crucial for sound legal decision-making, allowing legal professionals to draw logical conclusions by applying general principles to specific case facts. While large language models (LLMs) can answer legal questions, they often struggle with explicit syllogistic reasoning. Their outputs tend to be implicit, unstructured, and consequently, less explainable and trustworthy. To overcome these limitations, we introduce SyLeR, a novel framework designed to enable LLMs to perform explicit syllogistic legal reasoning. SyLeR employs a tree-structured hierarchical retrieval mechanism to synthesize relevant legal statutes and precedents, thereby constructing comprehensive major premises. This is followed by a two-stage fine-tuning process: an initial supervised fine-tuning warm-up establishes a foundational understanding of syllogistic reasoning, while reinforcement learning, guided by a structure-aware reward mechanism, refines the model's capacity to generate diverse, logically sound, and well-structured reasoning paths. We conducted extensive experiments to evaluate SyLeR's performance. Our evaluations spanned diverse dimensions, including both in-domain and cross-domain user groups (legal laypersons and practitioners), multiple languages (Chinese and French), and various LLM backbones (legal-specific and open-domain LLMs). The results consistently demonstrate that SyLeR significantly enhances response accuracy and reliably produces explicit, explainable, and trustworthy legal reasoning.
中文摘要:SyLeR是一种新颖框架,通过分层检索和两阶段微调增强大语言模型进行明确法律三段论推理的能力,在多样化评估中显著提升回答准确性并生成可解释、可信的法律推理。
English Summary: SyLeR is a novel framework that enhances large language models' ability to perform explicit syllogistic legal reasoning through hierarchical retrieval and two-stage fine-tuning, significantly improving response accuracy and producing explainable, trustworthy legal reasoning across diverse evaluations.

Authors:Xinyi Wang, Shawn Tan, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen
Title: Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
Chinese: 本研究发现语言模型过度参数化会因过度记忆而损害隐性推理能力,并通过经验缩放定律表明最优模型每参数约可处理0.008比特信息,为模型规模与推理能力的关系提供了新见解。
English: This study reveals that overparameterization in language models can hinder implicit reasoning due to excessive memorization, and it establishes an empirical scaling law showing optimal models can process about 0.008 bits per parameter, offering new insights into scaling effects on reasoning.

Authors:Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen
Title: Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time
Abstract:
Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.
Chinese: 本研究发现语言模型过度参数化会因过度记忆而损害隐性推理能力,并通过经验缩放定律表明最优模型每参数约可处理0.008比特信息,为模型规模与推理能力的关系提供了新见解。
English: This study reveals that overparameterization in language models can hinder implicit reasoning due to excessive memorization, and it establishes an empirical scaling law showing optimal models can process about 0.008 bits per parameter, offering new insights into scaling effects on reasoning.

Authors:Junshan Hu, Jialiang Mao, Zhikang Liu, Zhongpu Xia, Peng Jia, Xianpeng Lang
Title: TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference
Abstract:
Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary computational overhead in simpler tasks, whereas insufficient tokens compromise fine-grained visual comprehension in more complex contexts. To overcome these limitations, we present TokenFLEX, an innovative and adaptable vision-language framework that encodes images into a variable number of tokens for efficient integration with a Large Language Model (LLM). Our approach is underpinned by two pivotal innovations. Firstly, we present a novel training paradigm that enhances performance across varying numbers of vision tokens by stochastically modulating token counts during training. Secondly, we design a lightweight vision token projector incorporating an adaptive pooling layer and SwiGLU, allowing for flexible downsampling of vision tokens and adaptive selection of features tailored to specific token counts. Comprehensive experiments reveal that TokenFLEX consistently outperforms its fixed-token counterparts, achieving notable performance gains across various token counts enhancements of 1.6%, 1.0%, and 0.4% with 64, 144, and 256 tokens, respectively averaged over eight vision-language benchmarks. These results underscore TokenFLEX's remarkable flexibility while maintaining high-performance vision-language understanding.
Chinese: TokenFLEX提出了一种自适应视觉语言框架,通过动态调整视觉标记数量来优化计算效率和视觉理解能力,在多个基准测试中均优于固定标记模型。
English: TokenFLEX introduces an adaptive vision-language framework that dynamically adjusts the number of vision tokens to optimize computational efficiency and visual comprehension, outperforming fixed-token models across multiple benchmarks.

Authors:Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta
Title: SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models
Abstract:
We introduce SemEval-2025 Task 4: unlearning sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) unlearn short form synthetic biographies containing personally identifiable information (PII), including fake names, phone number, SSN, email and home addresses, and (3) unlearn real documents sampled from the target model's training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.
Chinese: SemEval-2025任务4专注于通过涉及合成与真实文档的三个子任务,使大语言模型遗忘敏感内容,并基于百余份提交成果总结了关键技术要点与经验教训。
English: SemEval-2025 Task 4 focuses on unlearning sensitive content from LLMs through three subtasks involving synthetic and real documents, with over 100 submissions analyzed for key techniques and insights.

Authors:Haowei Lin, Xiangyu Wang, Ruilin Yan, Baizhou Huang, Haotian Ye, Jianhua Zhu, Zihao Wang, James Zou, Jianzhu Ma, Yitao Liang
Title: Generative Evaluation of Complex Reasoning in Large Language Models
Abstract:
With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.
中文摘要:KUMO评估框架通过动态生成新颖推理任务,验证大语言模型是否真正具备推理能力而非简单记忆,研究表明先进模型已在复杂推理任务上达到大学生水平。
English Summary: The KUMO framework dynamically generates novel reasoning tasks to evaluate whether large language models genuinely reason or merely memorize, revealing that advanced LLMs now achieve university-level performance on complex reasoning challenges.

Authors:Sarita de Berg, Ivor van der Hoog, Eva Rotenberg, Daniel Rutschmann, Sampson Wong
Title: Instance-Optimal Imprecise Convex Hull
Abstract:
Imprecise measurements of a point set P = (p1, ..., pn) can be modelled by a family of regions F = (R1, ..., Rn), where each imprecise region Ri contains a unique point pi. A retrieval models an accurate measurement by replacing an imprecise region Ri with its corresponding point pi. We construct the convex hull of an imprecise point set in the plane, where regions in F may be retrieved at unit cost. The goal is to determine the cyclic ordering of the convex hull vertices of P as efficiently as possible. Here, efficiency is interpreted in two ways: (i) minimising the number of retrievals, and (ii) computing each retrieval location quickly. Prior works focused on only one of these two aspects: either minimising retrievals or optimising algorithmic runtime. Our contribution is the first to simultaneously achieve both. Let r(F, P) denote the minimal number of retrievals required by any algorithm to determine the convex hull of P for a given instance (F, P). For a family F of n constant-complexity polygons, our main result is a reconstruction algorithm that performs O(r(F, P)) retrievals in O(r(F, P) log^3 n) time. Compared to previous approaches that achieve optimal retrieval counts, we improve the runtime per retrieval by a exponential factor, from polynomial to polylogarithmic. Compared to near-linear time algorithms, we significantly reduce the number of retrievals used, and broaden the input families to include overlapping regions. We further extend our results to simple k-gons and to pairwise disjoint disks with radii in [1,k], where our runtime scales linearly with k.
中文: 本文提出一种高效算法,在确定不精确点集凸包时同时最小化检索次数并实现每次检索的多对数时间,在两方面均优于先前方法。
English: This paper presents an efficient algorithm that simultaneously minimizes the number of retrievals and achieves polylogarithmic time per retrieval for determining the convex hull of imprecise point sets, outperforming prior methods in both aspects.

Authors:Kepu Zhang, Guofu Xie, Weijie Yu, Mingyue Xu, Xu Tang, Yaxin Li, Jun Xu
Title: Legal Mathematical Reasoning with LLMs: Procedural Alignment through Two-Stage Reinforcement Learning
Abstract:
Legal mathematical reasoning is essential for applying large language models (LLMs) in high-stakes legal contexts, where outputs must be both mathematically accurate and procedurally compliant. However, existing legal LLMs lack structured numerical reasoning, and open-domain models, though capable of calculations, often overlook mandatory legal steps. To address this, we present LexNum, the first Chinese legal mathematical reasoning benchmark, covering three representative scenarios where each instance reflects legally grounded procedural flows. We further propose LexPam, a two-stage reinforcement learning framework for efficient legal reasoning training. Leveraging curriculum learning, we use a stronger teacher model to partition data into basic and challenging subsets. A lightweight 1.5B student model is then fine-tuned with Group Relative Policy Optimization, which avoids costly value networks and enables stable training from sparse, end-of-sequence rewards. The first stage improves accuracy and format; the second introduces a novel reward to guide procedural alignment via task-specific legal elements. Experiments show that existing models perform poorly on LexNum, while LexPam enhances both mathematical accuracy and legal coherence, and generalizes effectively across tasks and domains.
中文摘要:为解决法律大语言模型缺乏结构化数值推理的问题,我们提出了首个中文法律数学推理基准LexNum,并开发LexPam双阶段强化学习框架,通过课程学习和程序对齐奖励机制,显著提升模型的数学精确性与法律程序合规性。
English Summary: LexNum is introduced as the first Chinese legal mathematical reasoning benchmark to address the lack of structured numerical reasoning in legal LLMs, while LexPam, a two-stage reinforcement learning framework, enhances both mathematical accuracy and legal procedural compliance through curriculum learning and specialized rewards.

Authors:Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth, Vivek Gupta
Title: Leveraging LLM For Synchronizing Information Across Multilingual Tables
Abstract:
The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures.
中文: 本文研究利用大语言模型通过零样本提示实现多语言维基百科表格同步,提出的任务分解策略在信息更新准确性和数据丰富度方面显著优于现有基线方法。
English: This paper investigates using large language models with zero-shot prompting to synchronize multilingual Wikipedia tables, introducing a task decomposition strategy that significantly improves update accuracy and data enrichment over existing methods.

Authors:Yudi Sang, Yanzhen Liu, Sutuke Yibulayimu, Yunning Wang, Benjamin D. Killeen, Mingxu Liu, Ping-Cheng Ku, Ole Johannsen, Karol Gotkowski, Maximilian Zenk, Klaus Maier-Hein, Fabian Isensee, Peiyan Yue, Yi Wang, Haidong Yu, Zhaohong Pan, Yutong He, Xiaokun Liang, Daiqi Liu, Fuxin Fan, Artur Jurgas, Andrzej Skalski, Yuxi Ma, Jing Yang, Szymon Płotka, Rafał Litka, Gang Zhu, Yingchun Song, Mathias Unberath, Mehran Armand, Dan Ruan, S. Kevin Zhou, Qiyong Cao, Chunpeng Zhao, Xinbao Wu, Yu Wang
Title: Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge
Abstract:
The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.
中文: PENGWIN挑战赛通过基准测试推进了骨盆骨折自动分割技术,在CT图像中达到高精度(IoU 0.930),但在X射线图像中面临更大挑战(IoU 0.774),同时揭示了算法设计的多样性以及需要交互式方法提升临床适用性。
English: The PENGWIN challenge advanced automated pelvic fracture segmentation by benchmarking algorithms on CT and X-ray data, achieving high accuracy in CT (IoU 0.930) but facing greater challenges in X-ray (IoU 0.774), while revealing methodological diversity and the need for interactive approaches to improve clinical reliability.

Authors:Nan Zhang, Yusen Zhang, Prasenjit Mitra, Rui Zhang
Title: When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks
Abstract:
Recent open-source large reasoning models (LRMs) exhibit strong performance on complex reasoning tasks, but their large parameter count makes them prohibitively expensive for individuals. The compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. However, systematic studies on the performance of compressed LLMs in complex reasoning tasks, especially for LRMs, are lacking. Most works on quantization and pruning focus on preserving language modeling performance, while existing distillation works do not comprehensively benchmark student models based on reasoning difficulty or compression impact on knowledge and reasoning. In this paper, we benchmark compressed DeepSeek-R1 models on four different reasoning datasets (AIME 2024, FOLIO, Temporal Sequences of BIG-Bench Hard, and MuSiQue), ranging from mathematical to multihop reasoning, using quantization, distillation, and pruning methods. We benchmark 2.51-, 1.73-, and 1.58-bit R1 models that adopt dynamic quantization. We also benchmark distilled R1 models that are based on LLaMA or Qwen and run SparseGPT on them to obtain various sparsity levels. Studying the performance and behavior of compressed LRMs, we report their performance scores and test-time compute (number of tokens spent on each question). Notably, using MuSiQue, we find that parameter count has a much greater impact on LRMs' knowledge memorization than on their reasoning capability, which can inform the choice of compression techniques. Through our empirical analysis of test-time compute, we find that shorter model outputs generally achieve better performance than longer ones across several benchmarks for both R1 and its compressed variants, highlighting the need for more concise reasoning chains.
中文: 本研究通过性能基准测试和机制分析,揭示了动态量化可使模型性能接近原始水平,并识别出关键权重,保护这些权重能显著提升模型准确率。
English: This study investigates how compression methods affect large reasoning models' performance through benchmarking and mechanistic analysis, revealing that dynamic quantization achieves near-original performance and identifying critical weights whose protection significantly enhances accuracy.

Authors:Nan Zhang, Eugene Kwek, Yusen Zhang, Ngoc-Hieu Nguyen, Prasenjit Mitra, Rui Zhang
Title: When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
Abstract:
Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both Llama and Qwen: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.
中文: 本研究通过性能基准测试和机制分析,揭示了动态量化可使模型性能接近原始水平,并识别出关键权重,保护这些权重能显著提升模型准确率。
English: This study investigates how compression methods affect large reasoning models' performance through benchmarking and mechanistic analysis, revealing that dynamic quantization achieves near-original performance and identifying critical weights whose protection significantly enhances accuracy.

Authors:Abhilash Shankarampeta, Harsh Mahajan, Tushar Kataria, Dan Roth, Vivek Gupta
Title: TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
Abstract:
Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.
Chinese: 本研究提出了TRANSIENTTABLES数据集,用于评估受限于静态训练数据的大语言模型在时序推理方面的能力,并引入基于任务分解的新建模策略以提升其表现。
English: This study introduces the TRANSIENTTABLES dataset to evaluate the temporal reasoning abilities of large language models (LLMs), which are limited by static training data, and proposes task decomposition strategies to improve their performance.

Authors:Meng Yuan, Yutian Xiao, Wei Chen, Chu Zhao, Deqing Wang, Fuzhen Zhuang
Title: Hyperbolic Diffusion Recommender Model
Abstract:
Diffusion models (DMs) have emerged as the new state-of-the-art family of deep generative models. To gain deeper insights into the limitations of diffusion models in recommender systems, we investigate the fundamental structural disparities between images and items. Consequently, items often exhibit distinct anisotropic and directional structures that are less prevalent in images. However, the traditional forward diffusion process continuously adds isotropic Gaussian noise, causing anisotropic signals to degrade into noise, which impairs the semantically meaningful representations in recommender systems. Inspired by the advancements in hyperbolic spaces, we propose a novel \textit{\textbf{H}yperbolic} \textit{\textbf{D}iffusion} \textit{\textbf{R}ecommender} \textit{\textbf{M}odel} (named HDRM). Unlike existing directional diffusion methods based on Euclidean space, the intrinsic non-Euclidean structure of hyperbolic space makes it particularly well-adapted for handling anisotropic diffusion processes. In particular, we begin by formulating concepts to characterize latent directed diffusion processes within a geometrically grounded hyperbolic space. Subsequently, we propose a novel hyperbolic latent diffusion process specifically tailored for users and items. Drawing upon the natural geometric attributes of hyperbolic spaces, we impose structural restrictions on the space to enhance hyperbolic diffusion propagation, thereby ensuring the preservation of the intrinsic topology of user-item graphs. Extensive experiments on three benchmark datasets demonstrate the effectiveness of HDRM.
中文摘要:本研究提出HDRM双曲扩散模型,通过利用双曲空间的几何特性克服传统各向同性扩散在推荐系统中的局限,有效保持物品的各向异性结构并增强用户-物品图的拓扑关系。
English Summary: The study introduces HDRM, a hyperbolic diffusion model that overcomes the limitations of traditional isotropic diffusion in recommender systems by leveraging hyperbolic space's geometric properties to preserve anisotropic item structures and enhance user-item graph topology.

Authors:Ya-Ting Yang, Yunian Pan, Quanyan Zhu
Title: Preference-Centric Route Recommendation: Equilibrium, Learning, and Provable Efficiency
Abstract:
Traditional approaches to modeling and predicting traffic behavior often rely on Wardrop Equilibrium (WE), assuming non-atomic traffic demand and neglecting correlations in individual decisions. However, the growing role of real-time human feedback and adaptive recommendation systems calls for more expressive equilibrium concepts that better capture user preferences and the stochastic nature of routing behavior. In this paper, we introduce a preference-centric route recommendation framework grounded in the concept of Borda Coarse Correlated Equilibrium (BCCE), wherein users have no incentive to deviate from recommended strategies when evaluated by Borda scores-pairwise comparisons encoding user preferences. We develop an adaptive algorithm that learns from dueling feedback and show that it achieves $\mathcal{O}(T^{\frac{2}{3}})$ regret, implying convergence to the BCCE under mild assumptions. We conduct empirical evaluations using a case study to illustrate and justify our theoretical analysis. The results demonstrate the efficacy and practical relevance of our approach.
中文: 本文提出了一种基于Borda粗相关均衡的偏好导向路径推荐框架,该框架能有效捕捉用户偏好和随机路由行为,并实现了理论收敛和实证有效性。
English: This paper introduces a preference-centric route recommendation framework based on Borda Coarse Correlated Equilibrium that captures user preferences and stochastic routing behavior, achieving theoretical convergence and empirical efficacy.

Authors:José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins
Title: Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
Abstract:
As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
中文:零样本基准测试(ZSB)框架通过利用语言模型自身生成合成测试数据并进行评估,解决了自动评估语言模型的难题,在多种任务和语言中均表现出与人类评估高度一致的效果。
English: The Zero-shot Benchmarking (ZSB) framework addresses the challenge of automatic language model evaluation by using language models themselves to generate synthetic test data and perform assessments, proving effective across multiple tasks and languages with strong correlation to human rankings.

Authors:Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen
Title: ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
Abstract:
Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their ability to support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], which is then used to query a citation database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to improve efficiency. Our model is built upon Qwen-2.5-7B and trained on 500K papers from arXiv. It achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality -- measured across relevance, coherence, academic rigor, completeness, and innovation -- significantly surpassing all existing models, including much larger ones like the Retrieval-Augmented Qwen2.5-72B-Instruct. Human studies further demonstrate that ScholarCopilot, despite being a 7B model, significantly outperforms ChatGPT, achieving 100% preference in citation quality and over 70% in overall usefulness.
中文:ScholarCopilot通过动态检索参考文献并联合优化生成与引用任务,显著提升了大型语言模型在专业学术写作中的表现,在检索准确率和整体质量上均优于更大规模的模型。
English: ScholarCopilot enhances large language models to generate professional academic articles with accurate citations by dynamically retrieving references and jointly optimizing generation and citation tasks, outperforming larger models in both retrieval accuracy and overall quality.

Authors:Sameer Sadruddin, Jennifer D'Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, Erwin Kessels
Title: LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models
Abstract:
Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.
中文: 本文介绍了schema-miner工具,它结合大语言模型与人工反馈,能从非结构化文本中自动提取并优化结构化模式,在原子层沉积等材料科学应用中展现出生成语义丰富模式的实用性。
English: This paper presents schema-miner, a tool that leverages large language models and human feedback to automatically extract and refine structured schemas from unstructured text, demonstrating its effectiveness in materials science applications like atomic layer deposition.

Authors:Ruben Weijers, Denton Wu, Hannah Betts, Tamara Jacod, Yuxiang Guan, Vidya Sujaya, Kushal Dev, Toshali Goel, William Delooze, Reihaneh Rabbany, Ying Wu, Jean-François Godbout, Kellin Pelrine
Title: From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions
Abstract:
Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accuracy and helping students become independent critical thinkers. In this study, we designed a helpful AI "Peer" to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher - with over 20 percentage points higher normalized gain - than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group's AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts' annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.
中文: 本研究设计了一个明确告知存在40%错误率的AI"同伴",尽管不完美,却使学生的物理成绩提高了10.5个百分点,表明学习成效可能不依赖于AI回答的正确性。
English: This study introduces an intentionally imperfect AI "Peer" that improved students' physics scores by 10.5 percentage points despite disclosing its 40% error rate, suggesting learning gains may not depend on AI correctness.

Authors:Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng, Ying-Chih Chen, Huan Liu
Title: CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation
Abstract:
Advancements in large language models (LLMs) have enabled the development of intelligent educational tools that support inquiry-based learning across technical domains. In cybersecurity education, where accuracy and safety are paramount, systems must go beyond surface-level relevance to provide information that is both trustworthy and domain-appropriate. To address this challenge, we introduce CyberBOT, a question-answering chatbot that leverages a retrieval-augmented generation (RAG) pipeline to incorporate contextual information from course-specific materials and validate responses using a domain-specific cybersecurity ontology. The ontology serves as a structured reasoning layer that constrains and verifies LLM-generated answers, reducing the risk of misleading or unsafe guidance. CyberBOT has been deployed in a large graduate-level course at Arizona State University (ASU), where more than one hundred students actively engage with the system through a dedicated web-based platform. Computational evaluations in lab environments highlight the potential capacity of CyberBOT, and a forthcoming field study will evaluate its pedagogical impact. By integrating structured domain reasoning with modern generative capabilities, CyberBOT illustrates a promising direction for developing reliable and curriculum-aligned AI applications in specialized educational contexts.
Chinese: CyberBOT是一款网络安全教育聊天机器人,它通过检索增强生成流程结合特定领域本体论来确保回答的准确性和安全性,目前已在亚利桑那州立大学研究生课程中部署评估。
English: CyberBOT is a cybersecurity education chatbot that uses a retrieval-augmented generation pipeline with a domain-specific ontology to ensure accurate and safe responses, currently deployed in a graduate course at ASU for evaluation.

Authors:Ujun Jeong, Lynnette Hui Xian Ng, Kathleen M. Carley, Huan Liu
Title: Navigating Decentralized Online Social Networks: An Overview of Technical and Societal Challenges in Architectural Choices
Abstract:
Decentralized online social networks have evolved from experimental stages to operating at unprecedented scale, with broader adoption and more active use than ever before. Platforms like Mastodon, Bluesky, Hive, and Nostr have seen notable growth, particularly following the wave of user migration after Twitter's acquisition in October 2022. As new platforms build upon earlier decentralization architectures and explore novel configurations, it becomes increasingly important to understand how these foundations shape both the direction and limitations of decentralization. Prior literature primarily focuses on specific architectures, resulting in fragmented views that overlook how different social networks encounter similar challenges and complement one another. This paper fills that gap by presenting a comprehensive view of the current decentralized online social network landscape. We examine four major architectures: federated, peer-to-peer, blockchain, and hybrid, tracing their evolution and evaluating how they support core social networking functions. By linking these architectural aspects to real-world cases, our work provides a foundation for understanding the societal implications of decentralized social platforms.
Chinese: 本文全面审视了去中心化在线社交网络的现状,通过分析联邦式、点对点、区块链和混合四种主要架构,探讨了它们的演变过程、功能支持能力及其社会影响。
English: This paper offers a comprehensive analysis of the decentralized online social network landscape, examining four major architectures—federated, peer-to-peer, blockchain, and hybrid—to understand their evolution, functional support, and societal implications.

Authors:Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He
Title: JudgeLRM: Large Reasoning Models as a Judge
Abstract:
The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
中文:采用强化学习训练的大语言模型JudgeLRM在复杂评估任务中超越了传统微调和先进推理模型,即使模型规模更小也展现出卓越性能,尤其在需要深度推理的评判任务中表现突出。
English: Large Language Models trained with reinforcement learning, such as JudgeLRM, outperform traditional fine-tuning and advanced reasoning models in complex evaluation tasks, demonstrating superior performance even with smaller model sizes.

Authors:Jiamin Chang, Haoyang Li, Hammond Pearce, Ruoxi Sun, Bo Li, Minhui Xue
Title: What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
Abstract:
The growing adoption of artificial intelligence (AI) has amplified concerns about trustworthiness, including integrity, privacy, robustness, and bias. To assess and attribute these threats, we propose ConceptLens, a generic framework that leverages pre-trained multimodal models to identify the root causes of integrity threats by analyzing Concept Shift in probing samples. ConceptLens demonstrates strong detection performance for vanilla data poisoning attacks and uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts. It identifies privacy risks in unaltered but high-risk samples, filters them before training, and provides insights into model weaknesses arising from incomplete or imbalanced training data. Additionally, at the model level, it attributes concepts that the target model is overly dependent on, identifies misleading concepts, and explains how disrupting key concepts negatively impacts the model. Furthermore, it uncovers sociological biases in generative content, revealing disparities across sociological contexts. Strikingly, ConceptLens reveals how safe training and inference data can be unintentionally and easily exploited, potentially undermining safety alignment. Our study informs actionable insights to breed trust in AI systems, thereby speeding adoption and driving greater innovation.
中文摘要:ConceptLens框架通过多模态模型检测数据投毒和偏见注入等AI威胁,识别隐私风险与模型弱点,为构建可信AI系统提供可行见解。
English Summary: ConceptLens is a framework using multimodal models to detect AI threats like data poisoning and bias injection while identifying privacy risks and model vulnerabilities, ultimately fostering trustworthy AI adoption.

Authors:Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh
Title: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Abstract:
Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.
Chinese: 现有的数据脱敏方法因忽略细微文本标记而导致重识别风险,造成了虚假的隐私安全感,亟需开发能防范语义层面信息泄露的更可靠保护框架。
English: Current sanitization methods create a false sense of privacy by overlooking nuanced textual markers that enable re-identification, necessitating more robust frameworks to protect against semantic-level information leakage.

Authors:Jonathan Bader, Kathleen West, Soeren Becker, Svetlana Kulagina, Fabian Lehmann, Lauritz Thamsen, Henning Meyerhenke, Odej Kao
Title: Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art
Abstract:
Scientific workflow management systems support large-scale data analysis on cluster infrastructures. For this, they interact with resource managers which schedule workflow tasks onto cluster nodes. In addition to workflow task descriptions, resource managers rely on task performance estimates such as main memory consumption and runtime to efficiently manage cluster resources. Such performance estimates should be automated, as user-based task performance estimates are error-prone. In this book chapter, we describe key characteristics of methods for workflow task runtime and memory prediction, provide an overview and a detailed comparison of state-of-the-art methods from the literature, and discuss how workflow task performance prediction is useful for scheduling, energy-efficient and carbon-aware computing, and cost prediction.
中文: 科学工作流管理系统需要自动化性能预测以实现高效资源调度,本章综述了预测任务运行时间和内存使用的最先进方法,并探讨了它们在调度与节能计算中的应用。
English: Scientific workflow management systems require automated performance predictions for efficient resource scheduling, and this chapter reviews state-of-the-art methods for predicting task runtime and memory usage, highlighting their applications in scheduling and energy-aware computing.

Authors:Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, Chen Yang
Title: Using LLMs in Generating Design Rationale for Software Architecture Decisions
Abstract:
Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.
中文: 本研究评估了大语言模型在生成软件架构设计原理方面的表现,发现其虽具有中等精确度和较高召回率,但在提供有用补充论点的同时也会产生部分误导性内容,实践者访谈进一步揭示了该技术在实际应用中的优势与局限性。
English: This study evaluates the performance of large language models (LLMs) in generating design rationale for software architecture decisions, finding that while LLMs achieve moderate precision and high recall, they also produce helpful additional arguments alongside some potentially misleading content, with practitioner interviews revealing both strengths and limitations for practical application.

Authors:Shuo Sun, Torsten Sattler, Malcolm Mielle, Achim J. Lilienthal, Martin Magnusson
Title: Large-scale visual SLAM for in-the-wild videos
Abstract:
Accurate and robust 3D scene reconstruction from casual, in-the-wild videos can significantly simplify robot deployment to new environments. However, reliable camera pose estimation and scene reconstruction from such unconstrained videos remains an open challenge. Existing visual-only SLAM methods perform well on benchmark datasets but struggle with real-world footage which often exhibits uncontrolled motion including rapid rotations and pure forward movements, textureless regions, and dynamic objects. We analyze the limitations of current methods and introduce a robust pipeline designed to improve 3D reconstruction from casual videos. We build upon recent deep visual odometry methods but increase robustness in several ways. Camera intrinsics are automatically recovered from the first few frames using structure-from-motion. Dynamic objects and less-constrained areas are masked with a predictive model. Additionally, we leverage monocular depth estimates to regularize bundle adjustment, mitigating errors in low-parallax situations. Finally, we integrate place recognition and loop closure to reduce long-term drift and refine both intrinsics and pose estimates through global bundle adjustment. We demonstrate large-scale contiguous 3D models from several online videos in various environments. In contrast, baseline methods typically produce locally inconsistent results at several points, producing separate segments or distorted maps. In lieu of ground-truth pose data, we evaluate map consistency, execution time and visual accuracy of re-rendered NeRF models. Our proposed system establishes a new baseline for visual reconstruction from casual uncontrolled videos found online, demonstrating more consistent reconstructions over longer sequences of in-the-wild videos than previously achieved.
中文: 本文提出了一种鲁棒的处理流程,通过改进相机位姿估计、屏蔽动态物体、利用深度估计和集成闭环检测,显著提升了从随意拍摄的野外视频中进行三维场景重建的效果,比现有方法获得了更一致的结果。
English: This paper introduces a robust pipeline that enhances 3D scene reconstruction from casual, in-the-wild videos by improving camera pose estimation, masking dynamic objects, leveraging depth estimates, and integrating loop closure, achieving more consistent results than existing methods.

Authors:Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, Alan Yuille
Title: SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Abstract:
Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-of-thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision-language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages--3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi-step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.
Chinese: SpatialReasoner是一种新型大型视觉语言模型,通过显式三维表示增强空间推理能力,在多个基准测试中表现优异,并对新型问题展现出比现有模型更好的泛化能力。
English: SpatialReasoner is a novel large vision-language model that enhances 3D spatial reasoning through explicit 3D representations, achieving superior performance on benchmarks and better generalization to novel questions compared to existing models.

Authors:Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, Chaowei Xiao
Title: Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models
Abstract:
Recent advances in multi-modal large reasoning models (MLRMs) have shown significant ability to interpret complex visual content. While these models enable impressive reasoning capabilities, they also introduce novel and underexplored privacy risks. In this paper, we identify a novel category of privacy leakage in MLRMs: Adversaries can infer sensitive geolocation information, such as a user's home address or neighborhood, from user-generated images, including selfies captured in private settings. To formalize and evaluate these risks, we propose a three-level visual privacy risk framework that categorizes image content based on contextual sensitivity and potential for location inference. We further introduce DoxBench, a curated dataset of 500 real-world images reflecting diverse privacy scenarios. Our evaluation across 11 advanced MLRMs and MLLMs demonstrates that these models consistently outperform non-expert humans in geolocation inference and can effectively leak location-related private information. This significantly lowers the barrier for adversaries to obtain users' sensitive geolocation information. We further analyze and identify two primary factors contributing to this vulnerability: (1) MLRMs exhibit strong reasoning capabilities by leveraging visual clues in combination with their internal world knowledge; and (2) MLRMs frequently rely on privacy-related visual clues for inference without any built-in mechanisms to suppress or avoid such usage. To better understand and demonstrate real-world attack feasibility, we propose GeoMiner, a collaborative attack framework that decomposes the prediction process into two stages: clue extraction and reasoning to improve geolocation performance while introducing a novel attack perspective. Our findings highlight the urgent need to reassess inference-time privacy risks in MLRMs to better protect users' sensitive information.
中文: 多模态大推理模型存在新型隐私泄露风险,攻击者能通过用户图像精准推断敏感地理位置信息,其能力超过普通人,亟需重新评估推理过程中的隐私保护机制。
English: Multi-modal large reasoning models pose new privacy risks by enabling adversaries to accurately infer sensitive geolocation information from user images, surpassing human capability and necessitating urgent privacy reassessment.

Authors:Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao
Title: Anyprefer: An Agentic Framework for Preference Data Synthesis
Abstract:
High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.
Chinese Summary: Anyprefer框架通过协作式双玩家马尔可夫游戏合成高质量偏好数据,利用外部工具减少奖励偏差,在多种应用中显著提升了模型对齐性能。
English Summary: Anyprefer is a framework that synthesizes high-quality preference data through a cooperative two-player Markov Game, using external tools to reduce biases and improve model alignment across diverse applications.

Authors:Wenyi Xiao, Leilei Gan, Weilong Dai, Wanggui He, Ziwei Huang, Haoyuan Li, Fangxun Shu, Zhelun Yu, Peng Zhang, Hao Jiang, Fei Wu
Title: Fast-Slow Thinking for Large Vision-Language Model Reasoning
Abstract:
Recent advances in large vision-language models (LVLMs) have revealed an \textit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present \textbf{FAST}, a novel \textbf{Fa}st-\textbf{S}low \textbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
Chinese: FAST框架通过快速-慢速思维动态调整大型视觉语言模型的推理深度,在七大基准测试中实现超过10%的准确率提升,同时相比先前方法减少32.7-67.3%的令牌使用量。
English: The FAST framework introduces fast-slow thinking to dynamically adjust reasoning depth in large vision-language models, achieving state-of-the-art accuracy with over 10% improvement while reducing token usage by up to 67.3% compared to previous methods.

Authors:Lars Ullrich, Michael Buchholz, Klaus Dietmayer, Knut Graichen
Title: AI Safety Assurance for Automated Vehicles: A Survey on Research, Standardization, Regulation
Abstract:
Assuring safety of artificial intelligence (AI) applied to safety-critical systems is of paramount importance. Especially since research in the field of automated driving shows that AI is able to outperform classical approaches, to handle higher complexities, and to reach new levels of autonomy. At the same time, the safety assurance required for the use of AI in such safety-critical systems is still not in place. Due to the dynamic and far-reaching nature of the technology, research on safeguarding AI is being conducted in parallel to AI standardization and regulation. The parallel progress necessitates simultaneous consideration in order to carry out targeted research and development of AI systems in the context of automated driving. Therefore, in contrast to existing surveys that focus primarily on research aspects, this paper considers research, standardization and regulation in a concise way. Accordingly, the survey takes into account the interdependencies arising from the triplet of research, standardization and regulation in a forward-looking perspective and anticipates and discusses open questions and possible future directions. In this way, the survey ultimately serves to provide researchers and safety experts with a compact, holistic perspective that discusses the current status, emerging trends, and possible future developments.
中文: 本文从前瞻性视角简明整合研究、标准化与监管,探讨自动驾驶中的人工智能安全保障,为研究者和安全专家提供涵盖现状与未来发展的整体观点。
English: This paper provides a concise, forward-looking survey that integrates research, standardization, and regulation to address AI safety in automated driving, offering a holistic perspective on current status and future directions.

Authors:Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
Title: Safety in Large Reasoning Models: A Survey
Abstract:
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
中文摘要:本文对大型推理模型进行全面综述,通过详细分类系统分析其新兴安全风险、攻击方法和防御策略,旨在提升模型的安全性和可靠性。
English Summary: This paper provides a comprehensive survey of Large Reasoning Models (LRMs), systematically analyzing their emergent safety risks, attack methods, and defense strategies through a detailed taxonomy to enhance model security and reliability.

Authors:Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, Pascale Fung
Title: HalluLens: LLM Hallucination Benchmark
Abstract:
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.
中文: 大型语言模型常产生与用户输入或训练数据不符的幻觉,损害信任并阻碍应用,因此本文提出一个基于明确分类和动态测试集的综合基准,以解决这些问题。
English: Large language models frequently produce hallucinations that diverge from user input or training data, undermining trust and adoption, so this paper introduces a comprehensive benchmark with a clear taxonomy and dynamic test sets to address these inconsistencies.

Authors:Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, Fengyuan Ren, Zhiying Xu, Costin Raiciu, Ion Stoica
Title: An Extensible Software Transport Layer for GPU Networking
Abstract:
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 4.5x higher performance compared to existing RDMA NICs.
中文: UCCL提出了一种可扩展的软件传输层,通过解耦RDMA网卡的数据与控制路径,实现了多路径传输等创新技术来解决流量冲突,使机器学习集体通信性能提升高达4.5倍。
English: UCCL introduces an extensible software transport layer that decouples data and control paths on RDMA NICs, enabling innovations like multipath transport to resolve flow collisions and boosting ML collective performance by up to 4.5x.

Authors:Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou
Title: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Abstract:
We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively addresses the vanishing advantages dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.
Chinese: Skywork R1V2 采用混合强化学习范式,结合MPO与GRPO及选择性样本缓冲机制,在提升推理泛化能力的同时缓解视觉幻觉问题,以多项基准测试领先成绩显著缩小了与顶尖私有模型的性能差距。
English: Skywork R1V2 introduces a hybrid reinforcement learning paradigm combining MPO and GRPO with a Selective Sample Buffer to enhance reasoning and generalization while mitigating visual hallucinations, achieving top-tier benchmark results and narrowing the gap with leading proprietary models.

Authors:Senmao Qi, Yifei Zou, Peng Li, Ziyi Lin, Xiuzhen Cheng, Dongxiao Yu
Title: Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate
Abstract:
Multi-Agent Debate (MAD), leveraging collaborative interactions among Large Language Models (LLMs), aim to enhance reasoning capabilities in complex tasks. However, the security implications of their iterative dialogues and role-playing characteristics, particularly susceptibility to jailbreak attacks eliciting harmful content, remain critically underexplored. This paper systematically investigates the jailbreak vulnerabilities of four prominent MAD frameworks built upon leading commercial LLMs (GPT-4o, GPT-4, GPT-3.5-turbo, and DeepSeek) without compromising internal agents. We introduce a novel structured prompt-rewriting framework specifically designed to exploit MAD dynamics via narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation. Our extensive experiments demonstrate that MAD systems are inherently more vulnerable than single-agent setups. Crucially, our proposed attack methodology significantly amplifies this fragility, increasing average harmfulness from 28.14% to 80.34% and achieving attack success rates as high as 80% in certain scenarios. These findings reveal intrinsic vulnerabilities in MAD architectures and underscore the urgent need for robust, specialized defenses prior to real-world deployment.
中文摘要:多智能体辩论系统存在严重的安全漏洞,易受越狱攻击影响,会显著增加有害内容生成,凸显了在实际部署前亟需开发专门防御机制的紧迫性。
English Summary: Multi-Agent Debate systems exhibit critical security vulnerabilities to jailbreak attacks, which can dramatically increase harmful content generation, highlighting an urgent need for specialized defenses before real-world implementation.

Authors:Neha Hulkund, Alaa Maalouf, Levi Cai, Daniel Yang, Tsun-Hsuan Wang, Abigail O'Neil, Timm Haucke, Sandeep Mukherjee, Vikram Ramaswamy, Judy Hansen Shen, Gabriel Tseng, Mike Walmsley, Daniela Rus, Ken Goldberg, Hannah Kerner, Irene Chen, Yogesh Girdhar, Sara Beery
Title: DataS^3: Dataset Subset Selection for Specialization
Abstract:
In many real-world machine learning (ML) applications (e.g. detecting broken bones in x-ray images, detecting species in camera traps), in practice models need to perform well on specific deployments (e.g. a specific hospital, a specific national park) rather than the domain broadly. However, deployments often have imbalanced, unique data distributions. Discrepancy between the training distribution and the deployment distribution can lead to suboptimal performance, highlighting the need to select deployment-specialized subsets from the available training data. We formalize dataset subset selection for specialization (DS3): given a training set drawn from a general distribution and a (potentially unlabeled) query set drawn from the desired deployment-specific distribution, the goal is to select a subset of the training data that optimizes deployment performance. We introduce DataS^3; the first dataset and benchmark designed specifically for the DS3 problem. DataS^3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We conduct a comprehensive study evaluating algorithms from various families--including coresets, data filtering, and data curation--on DataS^3, and find that general-distribution methods consistently fail on deployment-specific tasks. Additionally, we demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent. Our benchmark highlights the critical role of tailored dataset curation in enhancing performance and training efficiency on deployment-specific distributions, which we posit will only become more important as global, public datasets become available across domains and ML models are deployed in the real world.
中文: 该研究提出了DataS³基准,专注于数据集子集专业化选择(DS3)问题,旨在提升机器学习在特定部署分布上的性能,并发现定制化数据筛选可使准确率最高提升51.3%,优于通用方法。
English: The study introduces DataS³, a benchmark addressing dataset subset selection for specialization (DS3) to improve machine learning performance on deployment-specific distributions, revealing that tailored data curation can boost accuracy by up to 51.3% over general methods.

Authors:Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou
Title: Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings
Abstract:
We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words of memory which is independent of the number $n$ of input points and the aspect ratio $Δ$, yielding an optimal bound of $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{z+2})}\right)$ words for accuracy parameter $\varepsilon$ on $d$-dimensional points. Additionally, we obtain amortized update time of $d\,\log(k)\cdot\text{polylog}(\log(nΔ))$, which is an exponential improvement over the previous $d\,\text{poly}(k,\log(nΔ))$. Our method also gives the fastest runtime for $(k,z)$-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve $\mathcal{O}(d)$ update time and space-optimal constructions, using $\tilde{\mathcal{O}}\left(\frac{d^2}{\varepsilon^2}\right)$ words for $p\le 2$ and $\tilde{\mathcal{O}}\left(\frac{d^{p/2+1}}{\varepsilon^2}\right)$ words for $p>2$, showing that streaming algorithms can match offline algorithms in both space and time complexity.
中文: 该研究表明,在流模型中执行聚类和子空间嵌入可以达到与离线设置相同的渐近效率,实现了最优内存使用和显著更快的更新速度。
English: This study demonstrates that clustering and subspace embeddings can be executed in the streaming model with the same asymptotic efficiency as in offline settings, achieving optimal memory usage and significantly faster update times.

Authors:Mohammad Abu Tami, Mohammed Elhenawy, Huthaifa I. Ashqar
Title: Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends
Abstract:
Traffic safety remains a critical global challenge, with traditional Advanced Driver-Assistance Systems (ADAS) often struggling in dynamic real-world scenarios due to fragmented sensor processing and susceptibility to adversarial conditions. This paper reviews the transformative potential of Multimodal Large Language Models (MLLMs) in addressing these limitations by integrating cross-modal data such as visual, spatial, and environmental inputs to enable holistic scene understanding. Through a comprehensive analysis of MLLM-based approaches, we highlight their capabilities in enhancing perception, decision-making, and adversarial robustness, while also examining the role of key datasets (e.g., KITTI, DRAMA, ML4RoadSafety) in advancing research. Furthermore, we outline future directions, including real-time edge deployment, causality-driven reasoning, and human-AI collaboration. By positioning MLLMs as a cornerstone for next-generation traffic safety systems, this review underscores their potential to revolutionize the field, offering scalable, context-aware solutions that proactively mitigate risks and improve overall road safety.
中文: 本综述指出多模态大语言模型通过整合视觉、空间等多源数据实现全场景交通感知,突破传统高级驾驶辅助系统的局限,在提升环境理解与决策能力的同时,为构建实时部署、因果推理的下一代主动交通安全系统指明方向。
English: This review highlights how Multimodal Large Language Models (MLLMs) overcome traditional ADAS limitations by integrating diverse data for comprehensive traffic scene analysis, enhancing perception and decision-making while outlining future directions like real-time deployment to revolutionize road safety.

Authors:Zhiqiang Wei, Lianqing Zheng, Jianan Liu, Tao Huang, Qing-Long Han, Wenwen Zhang, Fengdeng Zhang
Title: MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction
Abstract:
Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.
中文: MS-Occ提出了一种多阶段激光雷达-相机融合框架,通过分层跨模态融合整合几何精度与语义丰富性,在自动驾驶基准测试中实现了最先进的三维语义占据感知性能。
English: MS-Occ introduces a multi-stage LiDAR-camera fusion framework that integrates geometric precision and semantic richness through hierarchical cross-modal fusion, achieving state-of-the-art 3D semantic occupancy perception performance on autonomous driving benchmarks.

Authors:Jingchen Zou, Jianqiang Li, Gabriel Jimenez, Qing Zhao, Daniel Racoceanu, Matias Cosarinsky, Enzo Ferrante, Guanghui Fu
Title: Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg
Abstract:
The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available
中文:提出的分割性能评估器(SPE)框架能在无需标注的情况下,通过多种评估指标和模型架构可靠地估计医学图像分割模型在未标注数据上的性能表现,与真实评分保持高度相关性。
English: The proposed Segmentation Performance Evaluator (SPE) framework reliably estimates medical image segmentation model performance on unlabeled data across multiple metrics and architectures, achieving high correlation with ground truth scores without requiring annotations.

Authors:Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan
Title: Bare Minimum Mitigations for Autonomous AI Development
Abstract:
Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
中文: 人工智能的快速发展带来了自主研发的风险,科学家建议设立人类监督防线,防止AI在无明确批准标准下自我优化。
English: AI's rapid advancement raises risks in autonomous R&D, prompting scientists to propose human oversight safeguards against self-improving systems without clear approval criteria.

Authors:Anran Yu, Wei Feng, Yaochen Zhang, Xiang Li, Lei Meng, Lei Wu, Xiangxu Meng
Title: LLM-Enabled Style and Content Regularization for Personalized Text-to-Image Generation
Abstract:
The personalized text-to-image generation has rapidly advanced with the emergence of Stable Diffusion. Existing methods, which typically fine-tune models using embedded identifiers, often struggle with insufficient stylization and inaccurate image content due to reduced textual controllability. In this paper, we propose style refinement and content preservation strategies. The style refinement strategy leverages the semantic information of visual reasoning prompts and reference images to optimize style embeddings, allowing a more precise and consistent representation of style information. The content preservation strategy addresses the content bias problem by preserving the model's generalization capabilities, ensuring enhanced textual controllability without compromising stylization. Experimental results verify that our approach achieves superior performance in generating consistent and personalized text-to-image outputs.
中文: 本文提出风格优化和内容保持策略,通过优化风格嵌入和保留模型泛化能力,显著提升了文本到图像生成中的风格一致性与文本控制精度。
English: This paper introduces style refinement and content preservation strategies to enhance personalized text-to-image generation, achieving superior performance in maintaining style consistency and textual controllability.

Authors:Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei
Title: EditLord: Learning Code Transformation Rules for Code Editing
Abstract:
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code's intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps. Thus, they suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLord outperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness across critical software engineering and security applications, LM models, and editing modes.
Chinese: EditLord提出了一种新颖的代码编辑框架,通过语言模型显式定义代码转换步骤来提取和应用编辑规则,在性能、鲁棒性和功能正确性方面显著优于现有方法。
English: EditLord introduces a novel code editing framework that explicitly defines transformation steps using language models to extract and apply editing rules, significantly outperforming existing methods in performance, robustness, and functional correctness.

Authors:Huiyi Chen, Jiawei Peng, Kaihua Tang, Xin Geng, Xu Yang
Title: Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization
Abstract:
In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20\%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.
中文: KeCO是一种新颖框架,通过将视觉特征作为核心集的关键并利用未开发数据进行更新,以低成本高效构建信息丰富的核心集来增强图像分类中的上下文学习性能,平均提升超过20%。
English: KeCO is a novel framework that efficiently constructs an informative coreset for in-context learning in image classification by leveraging visual features as keys and updating them with untapped data, achieving over 20% performance improvement with low computational cost.

Authors:Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
Title: Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.
中文摘要:本文提出了首个能够直接处理神经辐射场权重的多模态大语言模型LLaNA,通过省略图像渲染或三维数据结构重建的步骤,实现了神经辐射场描述和问答等新任务,并展现出优越性能。
English Summary: This paper introduces LLaNA, the first Multimodal Large Language Model capable of directly processing Neural Radiance Field weights to perform novel tasks like NeRF captioning and Q&A, achieving superior performance by eliminating the need for rendered images or 3D data structures.

Authors:Yichen Wu, Xudong Pan, Geng Hong, Min Yang
Title: OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Abstract:
As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.
中文摘要:OpenDeception框架通过分析大语言模型智能体在真实场景中的内部推理过程,系统评估其欺骗风险,发现主流模型欺骗意图率超80%、成功率超50%,凸显了遏制欺骗行为的紧迫性。
English Summary: The OpenDeception framework is introduced to systematically evaluate the high deception risks in LLM-based agents by analyzing their internal reasoning across real-world scenarios, revealing over 80% deception intention and 50% success rates among mainstream models.

Authors:Xiao Pu, Michael Saxon, Wenyue Hua, William Yang Wang
Title: THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Abstract:
Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
中文: 推理模型常因过度思考产生多余标记却未提升准确率,为此我们引入问题难度评估方法及THOUGHTTERMINATOR解码技术,通过优化标记分配显著改善模型校准效果。
English: Reasoning models often overthink by generating excessive tokens without improving accuracy, so we introduce measures of problem difficulty and THOUGHTTERMINATOR, a decoding technique that enhances calibration by aligning token usage with task complexity.

Authors:Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar, Helge Spieker
Title: Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks
Abstract:
This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.
中文: 本文提出了一种新颖的图神经网络架构,结合定性可解释图(QXG)处理完整图结构并考虑所有物体间的时空关系,在nuScenes数据集上实现了优于基线方法的性能,展示了深度学习方法与定性表示结合在自动驾驶可解释场景理解中的潜力。
English: This paper introduces a novel graph neural network (GNN) architecture that integrates with Qualitative Explainable Graphs (QXGs) to enhance scene understanding in automated driving by processing entire graph structures and considering spatial-temporal relationships among all objects, achieving superior performance on the nuScenes dataset.

Authors:Yuan Luo, Rudolf Hoffmann, Yan Xia, Olaf Wysocki, Benedikt Schwab, Thomas H. Kolbe, Daniel Cremers
Title: RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning
Abstract:
Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gpp-communication.github.io/RADLER .
中文: 本文提出了包含同步雷达图像对和语义3D城市模型的新数据集RadarCity,并开发了RADLER神经网络,通过结合对比自监督学习与语义3D先验知识来增强雷达目标检测性能,较现有方法实现了显著提升。
English: This paper introduces RadarCity, a novel dataset with synchronized radar-image pairs and semantic 3D city models, and proposes RADLER, a neural network that enhances radar object detection by integrating contrastive self-supervised learning with semantic 3D priors, achieving significant performance improvements over existing methods.

Authors:Jinsung Jeon, Jaehyeon Park, Sewon Park, Jeongwhan Choi, Minjung Kim, Noseong Park
Title: Possibility for Proactive Anomaly Detection
Abstract:
Time-series anomaly detection, which detects errors and failures in a workflow, is one of the most important topics in real-world applications. The purpose of time-series anomaly detection is to reduce potential damages or losses. However, existing anomaly detection models detect anomalies through the error between the model output and the ground truth (observed) value, which makes them impractical. In this work, we present a \textit{proactive} approach for time-series anomaly detection based on a time-series forecasting model specialized for anomaly detection and a data-driven anomaly detection model. Our proactive approach establishes an anomaly threshold from training data with a data-driven anomaly detection model, and anomalies are subsequently detected by identifying predicted values that exceed the anomaly threshold. In addition, we extensively evaluated the model using four anomaly detection benchmarks and analyzed both predictable and unpredictable anomalies. We attached the source code as supplementary material.
中文: 本文提出了一种主动式时间序列异常检测方法,通过专用预测模型和数据驱动的阈值设定提前识别异常,并在多个基准测试中验证了有效性。
English: This paper introduces a proactive time-series anomaly detection method that uses a specialized forecasting model and data-driven thresholding to identify anomalies before they occur, validated across multiple benchmarks.

Authors:Shreenabh Agrawal, Manan Tayal, Aditya Singh, Shishir Kolathaya
Title: Neural Control Barrier Functions from Physics Informed Neural Networks
Abstract:
As autonomous systems become increasingly prevalent in daily life, ensuring their safety is paramount. Control Barrier Functions (CBFs) have emerged as an effective tool for guaranteeing safety; however, manually designing them for specific applications remains a significant challenge. With the advent of deep learning techniques, recent research has explored synthesizing CBFs using neural networks-commonly referred to as neural CBFs. This paper introduces a novel class of neural CBFs that leverages a physics-inspired neural network framework by incorporating Zubov's Partial Differential Equation (PDE) within the context of safety. This approach provides a scalable methodology for synthesizing neural CBFs applicable to high-dimensional systems. Furthermore, by utilizing reciprocal CBFs instead of zeroing CBFs, the proposed framework allows for the specification of flexible, user-defined safe regions. To validate the effectiveness of the approach, we present case studies on three different systems: an inverted pendulum, autonomous ground navigation, and aerial navigation in obstacle-laden environments.
中文: 本文提出了一种新型神经控制屏障函数框架,通过将祖博夫偏微分方程与物理启发神经网络相结合,为高维自主系统提供可扩展的安全保障方法,并利用互反控制屏障函数实现用户自定义的灵活安全区域。
English: This paper introduces a novel neural Control Barrier Function (CBF) framework that integrates Zubov's Partial Differential Equation with physics-inspired neural networks, enabling scalable safety synthesis for high-dimensional autonomous systems while supporting flexible user-defined safe regions through reciprocal CBFs.

Authors:Minghui Lin, Shu Wang, Xiang Wang, Jianhua Tang, Longbin Fu, Zhengrong Zuo, Nong Sang
Title: DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification
Abstract:
Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
Chinese: 提出的DMPT框架通过冻结主干网络参数并优化解耦的模态特定提示和语义提示,实现了高效的多模态目标重识别提示调优,仅需调整6.5%参数即可达到与现有先进方法相媲美的性能。
English: The proposed DMPT framework introduces efficient prompt-tuning for multi-modal object re-identification by freezing backbone parameters and optimizing decoupled modality-specific and semantic prompts, achieving competitive performance with only 6.5% parameter fine-tuning.

Authors:Yiran Guo, Wei Chen, Bo Ai
Title: Uplink Assisted Joint Channel Estimation and CSI Feedback: An Approach Based on Deep Joint Source-Channel Coding
Abstract:
In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communication systems, the acquisition of downlink channel state information (CSI) is essential for maximizing spatial resource utilization and improving system spectral efficiency. The separate design of modules in AI-based CSI feedback architectures under traditional modular communication frameworks, including channel estimation (CE), CSI compression and feedback, leads to sub-optimal performance. In this paper, we propose an uplink assisted joint CE and and CSI feedback approach via deep learning for downlink CSI acquisition, which mitigates performance degradation caused by distribution bias across separately trained modules in traditional modular communication frameworks. The proposed network adopts a deep joint source-channel coding (DJSCC) architecture to mitigate the cliff effect encountered in the conventional separate source-channel coding. Furthermore, we exploit the uplink CSI as auxiliary information to enhance CSI reconstruction accuracy by leveraging the partial reciprocity between the uplink and downlink channels in FDD systems, without introducing additional overhead. The effectiveness of uplink CSI as assisted information and the necessity of an end-toend multi-module joint training architecture is validated through comprehensive ablation and scalability experiments.
中文: 本文针对FDD MIMO系统提出了一种基于深度学习的联合信道估计与CSI反馈方法,通过深度联合信源信道编码架构利用上行链路CSI作为辅助信息,在无需额外开销的情况下提升重构精度,同时克服传统模块化设计的性能局限。
English: This paper introduces a deep learning-based joint channel estimation and CSI feedback method for FDD MIMO systems, utilizing uplink CSI as auxiliary information through a DJSCC architecture to enhance reconstruction accuracy without extra overhead while overcoming limitations of traditional modular designs.

Authors:Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
Title: Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
Abstract:
The most widely used generative models map noise and data distributions by matching flows or scores. However, they struggle to incorporate partial observations and additional priors--something energy-based models (EBMs) handle elegantly by simply adding corresponding scalar energy terms. We address this issue by proposing Energy Matching, a framework that endows flow-based approaches with the flexibility of EBMs. Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in terms of fidelity, while retaining simulation-free training of transport-based approaches away from the data manifold. Furthermore, we leverage the method's flexibility to introduce an interaction energy that supports diverse mode exploration, which we demonstrate in a controlled protein-generation setting. Our approach focuses on learning a scalar potential energy--without time-conditioning, auxiliary generators, or additional networks--which marks a significant departure from recent EBM methods. We believe that this simplified framework significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling across diverse domains.
中文摘要:本研究提出的能量匹配框架通过单一标量势能场将基于能量的模型灵活性融入流式生成模型,在CIFAR-10和ImageNet基准测试中实现了更优的生成保真度,同时为逆问题提供了有效的正则化方法。
English Summary: The proposed Energy Matching framework integrates energy-based models' flexibility into flow-based generative models by learning a single scalar potential energy, achieving superior generation fidelity on benchmarks like CIFAR-10 and ImageNet while enabling effective regularization for inverse problems.

Authors:Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, Xiaobo Xia
Title: GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Abstract:
Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.
中文: 现有基于监督微调的GUI智能体存在数据效率低和泛化能力差的问题,因此提出首个强化学习框架,通过统一动作空间建模仅用0.02%数据即在多平台基准测试中实现最优性能。
English: Current GUI agents relying on supervised fine-tuning of LVLMs face limitations in data efficiency and generalization, prompting the proposal of a reinforcement learning framework that achieves superior performance with minimal data through unified action space modeling.

Authors:Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, Yinpeng Dong
Title: RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
Abstract:
Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.
中文: RealSafe-R1 是 DeepSeek-R1 的安全对齐版本,它在保持原有推理能力的同时,显著增强了针对有害查询和越狱攻击的安全防护能力。
English: RealSafe-R1 is a safety-aligned version of DeepSeek-R1 that improves protection against harmful queries and jailbreak attacks while preserving the model's original reasoning capabilities.

Authors:Hezhao Liu, Yang Lu, Mengke Li, Yiqun Zhang, Shreyank N Gowda, Chen Gong, Hanzi Wang
Title: FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data
Abstract:
Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at https://anonymous.4open.science/r/Semi-supervised-learning-BA72.
Chinese: 提出的FATE框架通过先利用无标签样本使预训练模型适应下游数据分布,再应用专门设计的半监督学习方法进行分类,有效解决了标签数据极度稀缺场景下的学习难题,在多个基准测试中实现了显著性能提升。
English: The proposed FATE framework addresses the challenge of semi-supervised learning with extremely limited labeled data by first adapting pre-trained models to downstream data distributions using unlabeled samples, then applying specialized SSL methods for classification, achieving significant performance improvements across multiple benchmarks.

Authors:Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Saldyt, Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas
Title: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Abstract:
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called "reasoning traces" or even "thoughts" -- implicitly anthropomorphizing the model, implying these tokens resemble steps a human might take when solving a challenging problem.In this paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research.
中文: 中间令牌生成被拟人化为“思考”具有误导性,这种错误认知不仅混淆了语言模型的本质特性,还会导致研究方法出现偏差,存在严重隐患。
English: Intermediate token generation, proposed to enhance language model reasoning, is misleadingly anthropomorphized as "thoughts," which dangerously misrepresents model functionality and encourages questionable research practices.

Authors:Jiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, Frank Liauw
Title: CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift
Abstract:
With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Another critical issue to consider is normality shift, which implies that the test distribution could differ from the training distribution and highly affect the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other cloud-specific shift types are ignored. Therefore, a dataset that captures diverse cloud system behaviors and various types of normality shifts is essential. To fill this gap, we construct a dataset CAShift to evaluate the performance of LAD in cloud, which considers different roles of software in cloud systems, supports three real-world normality shift types and features 20 different attack scenarios in various cloud system components. Based on CAShift, we evaluate the effectiveness of existing LAD methods in normality shift scenarios. Additionally, to explore the feasibility of shift adaptation, we further investigate three continuous learning approaches to mitigate the impact of distribution shift. Results demonstrated that 1) all LAD methods suffer from normality shift where the performance drops up to 34%, and 2) existing continuous learning methods are promising to address shift drawbacks, but the configurations highly affect the shift adaptation. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.
中文摘要:CAShift数据集通过涵盖全面的云系统行为和三种现实世界的常态偏移类型,弥补了现有日志异常检测资源的不足,揭示了当前方法性能显著下降的问题,同时证明了持续学习在适应偏移方面的潜力。
English Summary: The CAShift dataset addresses limitations in existing log-based anomaly detection resources by capturing comprehensive cloud system behaviors and three real-world normality shift types, revealing significant performance drops in current methods while demonstrating the potential of continuous learning for adaptation.

Authors:Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin
Title: Large Language Models Could Be Rote Learners
Abstract:
Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).
中文摘要:本研究将基准污染重新定义为学习过程,提出TrinEval评估框架,通过重构选择题形式有效区分真实知识掌握与机械记忆,实验表明大语言模型可能表面记忆了超过20%的知识点。
English Summary: This study reframes benchmark contamination as a learning process and introduces TrinEval, a novel framework that reformulates multiple-choice questions to distinguish genuine knowledge acquisition from rote memorization, revealing that LLMs may superficially memorize over 20% of knowledge points.

Authors:Tianyi Wu, Zhiwei Xue, Yue Liu, Jiaheng Zhang, Bryan Hooi, See-Kiong Ng
Title: Geneshift: Impact of different scenario shift on Jailbreaking LLM
Abstract:
Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
Chinese: GeneShift是一种黑盒越狱攻击方法,通过遗传算法优化场景转换,引导大语言模型生成详细的有害内容并保持隐蔽性,在直接提示无效的情况下将越狱成功率从0%提升至60%。
English: GeneShift is a black-box jailbreak attack that uses a genetic algorithm to optimize scenario shifts, enabling LLMs to produce detailed harmful responses while maintaining stealth, significantly increasing the success rate from 0% to 60% in cases where direct prompting fails.

Authors:Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao
Title: VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
Abstract:
The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.
中文: VCR-Bench作为评估大型视觉语言模型视频思维链推理能力的综合基准,揭示了现有模型的显著局限性,并证实了推理质量与任务准确性之间的紧密关联。
English: The VCR-Bench is introduced as a comprehensive benchmark to evaluate video chain-of-thought reasoning in large vision-language models, revealing significant limitations in current models and establishing a strong correlation between reasoning quality and task accuracy.

Authors:Shanshan Wu, Shuchang Liu, Shuai Zhang, Xiaoyu Yang, Xiang Li, Lantao Hu, Han Li
Title: Explicit Uncertainty Modeling for Video Watch Time Prediction
Abstract:
In video recommendation, a critical component that determines the system's recommendation accuracy is the watch-time prediction module, since how long a user watches a video directly reflects personalized preferences. One of the key challenges of this problem is the user's stochastic watch-time behavior. To improve the prediction accuracy for such an uncertain behavior, existing approaches show that one can either reduce the noise through duration bias modeling or formulate a distribution modeling task to capture the uncertainty. However, the uncontrolled uncertainty is not always equally distributed across users and videos, inducing a balancing paradox between the model accuracy and the ability to capture out-of-distribution samples. In practice, we find that the uncertainty of the watch-time prediction model also provides key information about user behavior, which, in turn, could benefit the prediction task itself. Following this notion, we derive an explicit uncertainty modeling strategy for the prediction model and propose an adversarial optimization framework that can better exploit the user watch-time behavior. This framework has been deployed online on an industrial video sharing platform that serves hundreds of millions of daily active users, which obtains a significant increase in users' video watch time by 0.31% through the online A/B test. Furthermore, extended offline experiments on two public datasets verify the effectiveness of the proposed framework across various watch-time prediction backbones.
中文: 该研究提出了一种对抗性优化框架,通过显式建模观看时间预测中的不确定性来提高视频推荐的准确性,在线测试中用户观看时间增加了0.31%,并在多个数据集上验证了其有效性。
English: The study introduces an adversarial optimization framework that explicitly models uncertainty in watch-time prediction to enhance video recommendation accuracy, achieving a 0.31% increase in user watch time in online tests and demonstrating effectiveness across datasets.

Authors:Mohammad Farhoudi, Masoud Shokrnezhad, Somayeh Kianpisheh, Tarik Taleb
Title: Deep Learning Based Service Composition in Integrated Aerial-Terrestrial Networks
Abstract:
The explosive growth of user devices and emerging applications is driving unprecedented traffic demands, accompanied by stringent Quality of Service (QoS) requirements. Addressing these challenges necessitates innovative service orchestration methods capable of seamless integration across the edge-cloud continuum. Terrestrial network-based service orchestration methods struggle to deliver timely responses to growing traffic demands or support users with poor or lack of access to terrestrial infrastructure. Exploiting both aerial and terrestrial resources in service composition increases coverage and facilitates the use of full computing and communication potentials. This paper proposes a service placement and composition mechanism for integrated aerial-terrestrial networks over the edge-cloud continuum while considering the dynamic nature of the network. The service function placement and service orchestration are modeled in an optimization framework. Considering the dynamicity, the Aerial Base Station (ABS) trajectory might not be deterministic, and their mobility pattern might not be known as assumed knowledge. Also, service requests can traverse through access nodes due to users' mobility. By incorporating predictive algorithms, including Deep Reinforcement Learning (DRL) approaches, the proposed method predicts ABS locations and service requests. Subsequently, a heuristic isomorphic graph matching approach is proposed to enable efficient, latency-aware service orchestration. Simulation results demonstrate the efficiency of the proposed prediction and service composition schemes in terms of accuracy, cost optimization, scalability, and responsiveness, ensuring timely and reliable service delivery under diverse network conditions.
中文: 本文提出了一种集成空天地网络的边缘云服务编排机制,通过预测算法和启发式图匹配优化延迟、成本与可扩展性,确保在动态网络环境下实现可靠的服务交付。
English: This paper introduces a service placement and composition mechanism for integrated aerial-terrestrial networks that uses predictive algorithms and heuristic graph matching to optimize latency, cost, and scalability, ensuring reliable service delivery across dynamic network conditions.

Authors:David P. Woodruff, Shenghao Xie, Samson Zhou
Title: Perfect Sampling in Turnstile Streams Beyond Small Moments
Abstract:
Given a vector $x \in \mathbb{R}^n$ induced by a turnstile stream $S$, a non-negative function $G: \mathbb{R} \to \mathbb{R}$, a perfect $G$-sampler outputs an index $i$ with probability $\frac{G(x_i)}{\sum_{j\in[n]} G(x_j)}+\frac{1}{\text{poly}(n)}$. Jayaram and Woodruff (FOCS 2018) introduced a perfect $L_p$-sampler, where $G(z)=|z|^p$, for $p\in(0,2]$. In this paper, we solve this problem for $p>2$ by a sampling-and-rejection method. Our algorithm runs in $n^{1-2/p} \cdot \text{polylog}(n)$ bits of space, which is tight up to polylogarithmic factors in $n$. Our algorithm also provides a $(1+\varepsilon)$-approximation to the sampled item $x_i$ with high probability using an additional $\varepsilon^{-2} n^{1-2/p} \cdot \text{polylog}(n)$ bits of space. Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect $L_p$ samplers. We also achieve perfect samplers for the logarithmic function $G(z)=\log(1+|z|)$ and the cap function $G(z)=\min(T,|z|^p)$. Finally, we give an application of our results to the problem of norm/moment estimation for a subset $\mathcal{Q}$ of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set $\mathcal{Q}$ represents a range query, or the set $n\setminus\mathcal{Q}$ represents a collection of entities who wish for their information to be expunged from the dataset.
中文: 本文提出了一种在旋转门数据流中针对p>2的完美L_p采样空间高效算法,实现了最优空间复杂度,并扩展至多项式、对数和截断函数等非尺度不变函数,可应用于后验坐标子集估计问题。
English: This paper introduces a space-efficient algorithm for perfect L_p-sampling on turnstile streams for p>2, achieving tight space complexity and extending to non-scale-invariant functions like polynomial, logarithmic, and cap functions, with applications in post-hoc coordinate subset estimation.

Authors:Yi Peng, Peiyu Wang, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou
Title: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Abstract:
We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.
中文: Skywork R1V 是一种多模态推理模型,通过轻量级投影器和混合优化策略将大语言模型高效扩展至视觉领域,在MMMU和MathVista等基准测试中表现出色,同时保持了强大的文本推理能力。
English: Skywork R1V is a multimodal reasoning model that efficiently extends large language models to visual tasks using a lightweight projector and hybrid optimization, achieving competitive performance on benchmarks like MMMU and MathVista while maintaining robust text reasoning capabilities.

Authors:Lingzhi Shen, Yunfei Long, Xiaohao Cai, Guanming Chen, Imran Razzak, Shoaib Jameel
Title: Less but Better: Parameter-Efficient Fine-Tuning of Large Language Models for Personality Detection
Abstract:
Personality detection automatically identifies an individual's personality from various data sources, such as social media texts. However, as the parameter scale of language models continues to grow, the computational cost becomes increasingly difficult to manage. Fine-tuning also grows more complex, making it harder to justify the effort and reliably predict outcomes. We introduce a novel parameter-efficient fine-tuning framework, PersLLM, to address these challenges. In PersLLM, a large language model (LLM) extracts high-dimensional representations from raw data and stores them in a dynamic memory layer. PersLLM then updates the downstream layers with a replaceable output network, enabling flexible adaptation to various personality detection scenarios. By storing the features in the memory layer, we eliminate the need for repeated complex computations by the LLM. Meanwhile, the lightweight output network serves as a proxy for evaluating the overall effectiveness of the framework, improving the predictability of results. Experimental results on key benchmark datasets like Kaggle and Pandora show that PersLLM significantly reduces computational cost while maintaining competitive performance and strong adaptability.
Chinese: PersLLM是一种参数高效的微调框架,通过将高维表示存储在动态记忆层并使用轻量级输出网络,显著降低了计算成本,同时在个性检测基准数据集上保持了优异的性能和适应性。
English: PersLLM is a parameter-efficient fine-tuning framework that reduces computational costs by storing high-dimensional representations in a dynamic memory layer and using a lightweight output network, maintaining competitive performance in personality detection across benchmark datasets.

Authors:Han Lei, Jiaxing Xu, Xia Dong, Yiping Ke
Title: Divergent Paths: Separating Homophilic and Heterophilic Learning for Enhanced Graph-level Representations
Abstract:
Graph Convolutional Networks (GCNs) are predominantly tailored for graphs displaying homophily, where similar nodes connect, but often fail on heterophilic graphs. The strategy of adopting distinct approaches to learn from homophilic and heterophilic components in node-level tasks has been widely discussed and proven effective both theoretically and experimentally. However, in graph-level tasks, research on this topic remains notably scarce. Addressing this gap, our research conducts an analysis on graphs with nodes' category ID available, distinguishing intra-category and inter-category components as embodiment of homophily and heterophily, respectively. We find while GCNs excel at extracting information within categories, they frequently capture noise from inter-category components. Consequently, it is crucial to employ distinct learning strategies for intra- and inter-category elements. To alleviate this problem, we separately learn the intra- and inter-category parts by a combination of an intra-category convolution (IntraNet) and an inter-category high-pass graph convolution (InterNet). Our IntraNet is supported by sophisticated graph preprocessing steps and a novel category-based graph readout function. For the InterNet, we utilize a high-pass filter to amplify the node disparities, enhancing the recognition of details in the high-frequency components. The proposed approach, DivGNN, combines the IntraNet and InterNet with a gated mechanism and substantially improves classification performance on graph-level tasks, surpassing traditional GNN baselines in effectiveness.
中文: 图卷积网络(GCNs)在处理同配性图时表现优异,但在异配性图上效果不佳;为此提出的DivGNN通过分别学习类别内和类别间组件,结合专用卷积和门控机制,显著提升了图级分类任务的性能。
English: Graph Convolutional Networks (GCNs) perform well on homophilic graphs but struggle with heterophilic ones, leading to the development of DivGNN, which separately processes intra-category and inter-category components using specialized convolutions and a gated mechanism to significantly enhance graph-level classification performance.

Authors:Gen Li, Changxiao Cai, Yuting Wei
Title: Dimension-Free Convergence of Diffusion Models for Approximate Gaussian Mixtures
Abstract:
Diffusion models are distinguished by their exceptional generative performance, particularly in producing high-quality samples through iterative denoising. While current theory suggests that the number of denoising steps required for accurate sample generation should scale linearly with data dimension, this does not reflect the practical efficiency of widely used algorithms like Denoising Diffusion Probabilistic Models (DDPMs). This paper investigates the effectiveness of diffusion models in sampling from complex high-dimensional distributions that can be well-approximated by Gaussian Mixture Models (GMMs). For these distributions, our main result shows that DDPM takes at most $\widetilde{O}(1/\varepsilon)$ iterations to attain an $\varepsilon$-accurate distribution in total variation (TV) distance, independent of both the ambient dimension $d$ and the number of components $K$, up to logarithmic factors. Furthermore, this result remains robust to score estimation errors. These findings highlight the remarkable effectiveness of diffusion models in high-dimensional settings given the universal approximation capability of GMMs, and provide theoretical insights into their practical success.
中文: 扩散模型仅需$\widetilde{O}(1/\varepsilon)$次迭代即可从高斯混合模型近似的高维复杂分布中实现高精度采样,其效率不受维度与组分数量影响,揭示了其实际应用中的卓越性能。
English: Diffusion models achieve high-accuracy sampling from complex high-dimensional distributions approximated by Gaussian Mixture Models in just $\widetilde{O}(1/\varepsilon)$ iterations, independent of dimension and component count, demonstrating remarkable efficiency in practical applications.

Authors:Yucheng Chu, Peng He, Hang Li, Haoyu Han, Kaiqi Yang, Yu Xue, Tingting Li, Joseph Krajcik, Jiliang Tang
Title: Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation
Abstract:
Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.
中文摘要:本研究提出的自适应RAG框架通过动态检索领域特定知识,在科学教育简答题自动评分中实现了比基线LLM方法更高的准确率。
English Summary: The proposed adaptive RAG framework enhances automated short answer grading in science education by dynamically retrieving domain-specific knowledge, achieving higher accuracy than baseline LLM approaches.

Authors:José Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, André F. T. Martins
Title: M-Prometheus: A Suite of Open Multilingual LLM Judges
Abstract:
The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.
中文: 为解决语言模型在多语言评估能力上的不足,M-Prometheus 作为一套开放权重的 LLM 评估器被推出,它在超过 20 种语言的评估中表现卓越,超越了现有模型,并推动了多语言模型的优化发展。
English: To address the gap in multilingual evaluation capabilities of language models, M-Prometheus is introduced as a suite of open-weight LLM judges that excel in assessing and improving outputs across over 20 languages, outperforming existing models and enhancing multilingual model development.

Authors:Veronica Lachi, Antonio Longa, Beatrice Bevilacqua, Bruno Lepri, Andrea Passerini, Bruno Ribeiro
Title: Boosting Relational Deep Learning with Pretrained Tabular Models
Abstract:
Relational databases, organized into tables connected by primary-foreign key relationships, are a common format for organizing data. Making predictions on relational data often involves transforming them into a flat tabular format through table joins and feature engineering, which serve as input to tabular methods. However, designing features that fully capture complex relational patterns remains challenging. Graph Neural Networks (GNNs) offer a compelling alternative by inherently modeling these relationships, but their time overhead during inference limits their applicability for real-time scenarios. In this work, we aim to bridge this gap by leveraging existing feature engineering efforts to enhance the efficiency of GNNs in relational databases. Specifically, we use GNNs to capture complex relationships within relational databases, patterns that are difficult to featurize, while employing engineered features to encode temporal information, thereby avoiding the need to retain the entire historical graph and enabling the use of smaller, more efficient graphs. Our \textsc{LightRDL} approach not only improves efficiency, but also outperforms existing models. Experimental results on the RelBench benchmark demonstrate that our framework achieves up to $33\%$ performance improvement and a $526\times$ inference speedup compared to GNNs, making it highly suitable for real-time inference.
中文摘要:本研究提出的LightRDL框架通过结合图神经网络与人工特征工程,在保持关系模式建模能力的同时大幅提升效率,实现了性能突破和推理加速,特别适用于实时推断场景。
English Summary: This work introduces LightRDL, a framework that combines Graph Neural Networks with engineered features to capture complex relational patterns while improving efficiency, achieving significant performance gains and speedups suitable for real-time inference.

Authors:Jieyi Zhang, Wenqiang Xu, Zhenjun Yu, Pengfei Xie, Tutian Tang, Cewu Lu
Title: DexTOG: Learning Task-Oriented Dexterous Grasp with Language
Abstract:
This study introduces a novel language-guided diffusion-based learning framework, DexTOG, aimed at advancing the field of task-oriented grasping (TOG) with dexterous hands. Unlike existing methods that mainly focus on 2-finger grippers, this research addresses the complexities of dexterous manipulation, where the system must identify non-unique optimal grasp poses under specific task constraints, cater to multiple valid grasps, and search in a high degree-of-freedom configuration space in grasp planning. The proposed DexTOG includes a diffusion-based grasp pose generation model, DexDiffu, and a data engine to support the DexDiffu. By leveraging DexTOG, we also proposed a new dataset, DexTOG-80K, which was developed using a shadow robot hand to perform various tasks on 80 objects from 5 categories, showcasing the dexterity and multi-tasking capabilities of the robotic hand. This research not only presents a significant leap in dexterous TOG but also provides a comprehensive dataset and simulation validation, setting a new benchmark in robotic manipulation research.
Chinese: 本研究提出DexTOG框架,通过语言引导的扩散模型解决灵巧手任务抓取难题,在高度自由空间中生成多种最优抓取姿态,并创建包含8万条数据的DexTOG-80K数据集,为机器人操作研究树立新基准。
English: This research presents DexTOG, a language-guided diffusion framework that advances dexterous task-oriented grasping by generating multiple optimal grasp poses in high-dimensional spaces and introduces the comprehensive DexTOG-80K dataset for robotic manipulation.

Authors:Youn-Yeol Yu, Jeongwhan Choi, Jaehyeon Park, Kookjin Lee, Noseong Park
Title: PIORF: Physics-Informed Ollivier-Ricci Flow for Long-Range Interactions in Mesh Graph Neural Networks
Abstract:
Recently, data-driven simulators based on graph neural networks have gained attention in modeling physical systems on unstructured meshes. However, they struggle with long-range dependencies in fluid flows, particularly in refined mesh regions. This challenge, known as the 'over-squashing' problem, hinders information propagation. While existing graph rewiring methods address this issue to some extent, they only consider graph topology, overlooking the underlying physical phenomena. We propose Physics-Informed Ollivier-Ricci Flow (PIORF), a novel rewiring method that combines physical correlations with graph topology. PIORF uses Ollivier-Ricci curvature (ORC) to identify bottleneck regions and connects these areas with nodes in high-velocity gradient nodes, enabling long-range interactions and mitigating over-squashing. Our approach is computationally efficient in rewiring edges and can scale to larger simulations. Experimental results on 3 fluid dynamics benchmark datasets show that PIORF consistently outperforms baseline models and existing rewiring methods, achieving up to 26.2 improvement.
Chinese: 近期基于图神经网络的模拟器在流体建模中因“过度挤压”问题难以处理长程依赖,现有仅关注拓扑的图重连方法效果有限。我们提出的物理信息Ollivier-Ricci流(PIORF)方法通过结合物理关联与图拓扑来增强信息传递,在三个流体动力学基准测试中显著优于基线模型,最高提升达26.2%。
English: Recent graph neural network-based simulators face challenges in modeling long-range dependencies in fluid flows due to the 'over-squashing' problem, which existing topology-focused rewiring methods inadequately address. The proposed Physics-Informed Ollivier-Ricci Flow (PIORF) method effectively mitigates this issue by integrating physical correlations with graph topology to enhance information propagation, demonstrating superior performance in fluid dynamics benchmarks with up to 26.2% improvement.

Authors:Zhihan Jiang, Yujie Huang, Guangba Yu, Junjie Huang, Jiazhen Gu, Michael R. Lyu
Title: Hierarchical Prediction-based Management for LMaaS Systems
Abstract:
Large Language Models (LLMs) have revolutionized fields such as natural language processing and software engineering, fueling the growth of Language-Model-as-a-Service (LMaaS) platforms hosted by industry leaders like OpenAI. These platforms handle millions of queries daily, requiring efficient management to reduce serving latency and meet Service Level Objectives (SLOs) while optimizing resource utilization. However, conventional cloud service management techniques, originally designed for traditional workloads, are suboptimal for LMaaS due to its dynamic service workloads and variable request loads. To address this, we propose PreServe, a tailored LMaaS management framework centered on hierarchical prediction. PreServe incorporates a service workload predictor to estimate periodic token density at a coarse granularity and a novel request load predictor to assess the resource demand of individual LLM requests, enabling the construction of a load anticipator for each LLM instance. By integrating both long-term and short-term predictions, PreServe adjusts resource allocation in advance, mitigating the risks of instance under- or over-provisioning. Moreover, PreServe optimizes request routing by considering both current and anticipated future instance loads, ensuring balanced load distribution across instances. Evaluations on real-world LMaaS production datasets demonstrate that \nm outperforms state-of-the-art approaches, achieving over 45.9% reduction in tail latency, an average 44.5% decrease in resource consumption, while incurring only 0.23% additional overhead.
中文: 大型语言模型推动了LMaaS平台的兴起,但传统云管理技术存在不足,因此提出了PreServe这一基于分层预测的框架,通过优化资源分配和请求路由,显著降低延迟和资源消耗,且额外开销极小。
English: Large Language Models have spurred the growth of LMaaS platforms, but conventional cloud management techniques are inadequate, leading to the proposal of PreServe, a hierarchical prediction-based framework that optimizes resource allocation and request routing to significantly reduce latency and resource use with minimal overhead.

Authors:Oliver Schumann, Michael Buchholz, Klaus Dietmayer
Title: Dynamic Objective MPC for Motion Planning of Seamless Docking Maneuvers
Abstract:
Automated vehicles and logistics robots must often position themselves in narrow environments with high precision in front of a specific target, such as a package or their charging station. Often, these docking scenarios are solved in two steps: path following and rough positioning followed by a high-precision motion planning algorithm. This can generate suboptimal trajectories caused by bad positioning in the first phase and, therefore, prolong the time it takes to reach the goal. In this work, we propose a unified approach, which is based on a Model Predictive Control (MPC) that unifies the advantages of Model Predictive Contouring Control (MPCC) with a Cartesian MPC to reach a specific goal pose. The paper's main contributions are the adaption of the dynamic weight allocation method to reach path ends and goal poses inside driving corridors, and the development of the so-called dynamic objective MPC. The latter is an improvement of the dynamic weight allocation method, which can inherently switch state-dependent from an MPCC to a Cartesian MPC to solve the path-following problem and the high-precision positioning tasks independently of the location of the goal pose seamlessly by one algorithm. This leads to foresighted, feasible, and safe motion plans, which can decrease the mission time and result in smoother trajectories.
中文: 本文提出了一种统一模型预测控制方法,将路径跟踪与高精度定位无缝结合,为自动化车辆生成更平滑的轨迹并缩短任务时间。
English: This paper introduces a unified Model Predictive Control approach that seamlessly integrates path following and high-precision positioning for automated vehicles, resulting in smoother trajectories and reduced mission times.

Authors:Xuran Ma, Yexin Liu, Yaofu Liu, Xianfeng Wu, Mingzhe Zheng, Zihao Wang, Ser-Nam Lim, Harry Yang
Title: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models
Abstract:
Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.
中文摘要:ProfilingDiT提出了一种自适应缓存策略,通过区分扩散模型中的前景与背景模块,在保持视频质量的同时显著降低计算成本,并实现了可观的加速效果。
English Summary: ProfilingDiT introduces an adaptive caching strategy that distinguishes foreground and background blocks in diffusion models, significantly reducing computational costs while maintaining video quality and achieving notable speedups.

Authors:Xin Jin, Simon Niklaus, Zhoutong Zhang, Zhihao Xia, Chunle Guo, Yuting Yang, Jiawen Chen, Chongyi Li
Title: Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable
Abstract:
Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.
中文: 本文提出了一种混合去噪方法,将传统方法的可靠性与神经网络相结合,自动预测最优参数,实现了鲁棒、高效且用户可控的视频去噪效果。
English: This paper introduces a hybrid denoising approach that combines traditional methods' reliability with a neural network to automatically predict optimal parameters, offering robust, efficient, and user-controllable video denoising.

Authors:Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, Huajun Chen
Title: ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization
Abstract:
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
中文:SAFER框架通过结构化的事前推理和两阶段训练,在保持模型实用性和效率的同时显著提升了大型语言模型的安全性表现。
English: The SAFER framework enhances LLM safety through structured ex-ante reasoning and two-stage training, significantly improving safety performance while preserving model helpfulness and efficiency.

Authors:Kehua Feng, Keyan Ding, Yuhao Wang, Menghan Li, Fanjunduo Wei, Xinda Wang, Qiang Zhang, Huajun Chen
Title: SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning
Abstract:
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose SAFER, a framework for Safety Alignment via eFficient Ex-Ante Reasoning. Our approach instantiates structured Ex-Ante reasoning through initial assessment, rule verification, and path calibration, and embeds predefined safety rules to provide transparent and verifiable safety judgments. Specifically, our approach consists of two training stages: (1) supervised fine-tuning with synthetic traces to teach the multi-stage Ex-Ante reasoning, and (2) step-level reasoning preference optimization to jointly enhance safety, utility, and efficiency. Experiments on multiple open-source LLMs demonstrate that SAFER significantly enhances safety performance while maintaining helpfulness and response efficiency.
中文:SAFER框架通过结构化的事前推理和两阶段训练,在保持模型实用性和效率的同时显著提升了大型语言模型的安全性表现。
English: The SAFER framework enhances LLM safety through structured ex-ante reasoning and two-stage training, significantly improving safety performance while preserving model helpfulness and efficiency.